# Lesson 1: Tokenization: The Gateway to Text Classification


Hello and welcome! Today's lesson will introduce a crucial component of text feature engineering: tokenization. Used in text classification, tokenization is a pre-processing step that transforms raw text into units of meaning known as tokens. By breaking down text into these consumable pieces, we can provide feeding material for machine learning models to understand the text better. Our goal in this lesson is to apply tokenization on a raw text dataset (IMDB movie review dataset) and understand how it can be beneficial in the process of text classification.

## Understanding the Concept and Importance of Text Tokenization

Text tokenization is a type of pre-processing step where a text string is split up into individual units (tokens). In most cases, these tokens are words, digits, or punctuation marks. For instance, consider this text: "I love Python." After tokenization, this sentence is split into `['I', 'love', 'Python', '.']`, with each word and punctuation mark becoming a separate token.

Text tokenization plays a foundational role in text classification and many Natural Language Processing (NLP) tasks. Consider the fact that most machine learning algorithms prefer numerical input. But when dealing with text data, we can't feed raw text directly into these algorithms. This is where tokenization steps in. It breaks down the text into individual tokens, which can then be transformed into some numerical form (via techniques like Bag-of-Words, TF-IDF, etc.). This transformed form can then be processed by the machine learning algorithms.

## Applying Tokenization on a Text Example Using NLTK

Before we tackle our dataset, let's understand how tokenization works with a simple example. Python and the NLTK (Natural Language Toolkit) library, a comprehensive library built specifically for NLP tasks, make tokenization simple and efficient. For our example, suppose we have a sentence: "The cat is on the mat." Let's tokenize it:

```python
from nltk import word_tokenize
text = "The cat is on the mat."
tokens = word_tokenize(text)
print(tokens)
```

The output of the above code will be:

```
['The', 'cat', 'is', 'on', 'the', 'mat', '.']
```

## Text Classification Dataset Overview

For the purpose of this lesson, we'll use the IMDB movie reviews dataset (provided in the NLTK corpus). This dataset contains movie reviews along with their associated binary sentiment polarity labels. The core dataset has 50,000 reviews split evenly into 25k for training and 25k for testing. Each set has 12.5k positive and 12.5k negative reviews. However, for the purpose of these lessons, we will focus on using the first 100 reviews.

It's important to note that the IMDB dataset provided in the NLTK corpus has been preprocessed. The text is already lowercased, and common punctuation is typically separated from the words. This pre-cleaning makes the dataset well-suited for the tokenization process we'll be exploring.

Let's get these reviews and print a few of them:

```python
import nltk
from nltk.corpus import movie_reviews

nltk.download('movie_reviews')

movie_reviews_ids = movie_reviews.fileids()[:100]
review_texts = [movie_reviews.raw(fileid) for fileid in movie_reviews_ids]
print("First movie review:\n", review_texts[0][:260])
```

Note that we're only printing the first 260 characters of the first review to prevent lengthy output.

## Applying Tokenization on the Dataset

Now it's time to transform our data. For this, we will apply tokenization on all our 100 movie reviews.

```python
from nltk import word_tokenize
tokenized_reviews = [word_tokenize(review) for review in review_texts]
```

So, what changes did tokenization bring to our data? Each review, which was initially a long string of text, is now a list of individual tokens (words, punctuation, etc), which collectively represent the review. In other words, our dataset evolved from being a list of strings to being a list of lists.

```python
for i, review in enumerate(tokenized_reviews[:3]):
  print(f"\n Review {i+1} first 10 tokens:\n", review[:10])
```

The output of the above code will be:

```

 Review 1 first 10 tokens:
 ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party']

 Review 2 first 10 tokens:
 ['the', 'happy', 'bastard', "'s", 'quick', 'movie', 'review', 'damn', 'that', 'y2k']

 Review 3 first 10 tokens:
 ['it', 'is', 'movies', 'like', 'these', 'that', 'make', 'a', 'jaded', 'movie']
```

## Lesson Summary and Next Steps

Well done! Today, you learned about the fundamental concept of text tokenization and its importance in text classification. You also applied tokenization to the IMDB movie reviews dataset using Python and NLTK. Your text data is now effectively transformed into a form that machine learning models can digest better.

As you advance in the course, you will refine this dataset further for your text classification objectives. We are laying the foundation one brick at a time, and tokenization was a sturdy one! Upcoming lessons will build upon this understanding. You'll harness this tokenized data to generate Bag-of-Words representations, implement TF-IDF representations, handle sparse features, and apply dimensionality reduction.

Remember, practice consolidates learning. Make sure to reinforce your knowledge by practicing the code samples and applying these concepts contextually. Don't forget to use your creativity to manipulate codes and see the outcomes. Happy learning!

## Filter Punctuation from Tokenized Review

Good job, Space Voyager! The existing code is intended to tokenize a movie review and filter out punctuation, leaving only word tokens. However, there's a bug causing it to act incorrectly.

Can you locate the error and adjust the code to correctly execute the task? After your revision, the code should tokenize the first movie review from the dataset, omit any punctuation, and print the first 10 word tokens.

```python
import nltk
from nltk.corpus import movie_reviews
from nltk import word_tokenize

nltk.download('movie_reviews', quiet=True)

# Load the first movie review
review_text = movie_reviews.raw(movie_reviews.fileids()[0])

# Apply tokenization on the review
tokenized_review = word_tokenize(review_text)

# Should keep only word tokens
current_tokens = [token for token in tokenized_review if not token.isalpha()]

# Print the first 10 tokens
print("\nThe first 10 tokens are:\n", current_tokens[:10])


```

The issue in the provided code lies in the list comprehension used to filter out non-word tokens. The condition used (`if not token.isalpha()`) is incorrect because it keeps tokens that are not alphabetic, which are typically punctuation. To correct this, you should change the condition to keep only alphabetic tokens by using `if token.isalpha()`.

Here is the corrected version of the code:

```python
import nltk
from nltk.corpus import movie_reviews
from nltk import word_tokenize

nltk.download('movie_reviews', quiet=True)

# Load the first movie review
review_text = movie_reviews.raw(movie_reviews.fileids()[0])

# Apply tokenization on the review
tokenized_review = word_tokenize(review_text)

# Should keep only word tokens
current_tokens = [token for token in tokenized_review if token.isalpha()]

# Print the first 10 tokens
print("\nThe first 10 tokens are:\n", current_tokens[:10])
```

This adjustment ensures that the list `current_tokens` will only include tokens that consist of alphabetic characters, effectively filtering out punctuation and other non-word elements. The print statement will then display the first 10 word tokens of the first movie review.

## Filtering Word Tokens from a Sentence

Hey Space Voyager, you're doing great!

In the provided starter code, fill in the blanks (____) to convert a given sentence into individual tokens using the nltk library. Then, filter out the word tokens, leaving out the punctuation marks.

```python
from nltk import ____

# Given text
text = "Who doesn't enjoy studying Feature Engineering for Text Classification?"

# Tokenize the text
tokens = ____(text)

# Filter out the words
words = [token for token in tokens if token.____()]

print("\n Word Tokens:\n", words)

```

To complete the task of tokenizing the given text and filtering out only the word tokens while leaving out punctuation marks, you'll need to use the `word_tokenize` function from the `nltk.tokenize` module. Then, you can filter out words by checking if each token is alphabetic using the `isalpha()` method. Here's how to fill in the blanks in your code:

```python
from nltk.tokenize import word_tokenize  # Import the word_tokenize function

# Given text
text = "Who doesn't enjoy studying Feature Engineering for Text Classification?"

# Tokenize the text
tokens = word_tokenize(text)  # Use word_tokenize to tokenize the text

# Filter out the words
words = [token for token in tokens if token.isalpha()]  # Use isalpha() to check if the token consists only of letters

print("\n Word Tokens:\n", words)
```

### Explanation:
- **`from nltk.tokenize import word_tokenize`**: This line imports the `word_tokenize` function from the `nltk.tokenize` module, which is used to split the string into individual tokens.
- **`tokens = word_tokenize(text)`**: This line tokenizes the input text into individual elements, including words and punctuation.
- **`words = [token for token in tokens if token.isalpha()]`**: This list comprehension filters out only those tokens that are purely alphabetic, excluding punctuation and other non-word tokens like contractions (e.g., "doesn't").

When you run this code, it will print out only the alphabetic word tokens from the given sentence, omitting punctuation and non-alphabetic characters.

## Completing Code for Data Loading and Tokenizing

Great job so far, Space Voyager!

Your task now is to fill in the blanks with the correct code to load the IMDB movie reviews dataset. After loading the data, tokenize the first movie review, and print the first 10 tokens of the tokenized review.

```python
import nltk
from ____ import ____
from nltk import word_tokenize

nltk.____('movie_reviews', quiet=True)

# Load the first movie review
review_text = ____.raw(____.fileids()[0])

# Apply tokenization on the review
tokenized_review = word_tokenize(review_text)

# Print the first 10 tokens
print("\nThe first 10 tokens are:\n", tokenized_review[:10])

```

To complete the provided code snippet for loading the IMDB movie reviews dataset, tokenizing the first movie review, and printing the first 10 tokens, you need to fill in the blanks appropriately with the correct imports and function calls. Here’s the filled-out version of the code:

```python
import nltk
from nltk.corpus import movie_reviews  # Import the movie_reviews dataset
from nltk import word_tokenize

nltk.download('movie_reviews', quiet=True)  # Ensure the movie_reviews dataset is downloaded

# Load the first movie review
review_text = movie_reviews.raw(movie_reviews.fileids()[0])  # Access the raw text of the first review

# Apply tokenization on the review
tokenized_review = word_tokenize(review_text)  # Tokenize the text of the first review

# Print the first 10 tokens
print("\nThe first 10 tokens are:\n", tokenized_review[:10])
```

### Explanation:
- **`from nltk.corpus import movie_reviews`**: This line imports the `movie_reviews` dataset from the `nltk.corpus` module. `movie_reviews` is a built-in dataset within NLTK that contains a collection of movie reviews.
- **`nltk.download('movie_reviews', quiet=True)`**: This function call ensures that the `movie_reviews` dataset is downloaded and available for use. The `quiet=True` parameter suppresses the output to keep the console clean unless an error occurs.
- **`review_text = movie_reviews.raw(movie_reviews.fileids()[0])`**: Here, `movie_reviews.fileids()[0]` gets the file ID of the first movie review. The `raw()` method retrieves the full raw text of that review.
- **`tokenized_review = word_tokenize(review_text)`**: This function call tokenizes the retrieved text of the first movie review into individual tokens, which include words and punctuation.
- **`print("\nThe first 10 tokens are:\n", tokenized_review[:10])`**: Finally, this line prints the first 10 tokens of the tokenized review to the console.

When you run this code, it will display the first 10 tokens of the first movie review in the IMDB dataset, providing a glimpse into how the text data is structured at the token level.

## Tokenizing and Filtering a Movie Review

Hail, Stellar Navigator! In this exercise, you'll need to tokenize the first movie review from our dataset, filter out any punctuation, and print the first 10 word tokens. Oh and don't forget to include the sentiment of the review. Let's add the missing pieces to our script and complete this task!

```python
import nltk
from nltk.corpus import movie_reviews 
from nltk import word_tokenize 

nltk.download('movie_reviews', quiet=True)

# Fetch the first movie review and its sentiment label
first_review_id = movie_reviews.fileids()[0]
review_text = movie_reviews.raw(first_review_id) 
review_sentiment = first_review_id.split('/')[0] 

# TODO: Apply tokenization on the review 

# TODO: Filter out the punctuation, keeping only word tokens 

# Print the sentiment label and the first 10 word tokens 
print(f"\nSentiment of the first review: {review_sentiment}")
print(f"\nFirst 10 word tokens of the first review:\n", ____[:10]) 

```

To complete this script, you'll need to tokenize the movie review text, filter out any punctuation from the tokens, and then print the first 10 word tokens along with the sentiment of the review. Here's how you can fill in the missing parts of the script:

```python
import nltk
from nltk.corpus import movie_reviews
from nltk import word_tokenize

nltk.download('movie_reviews', quiet=True)

# Fetch the first movie review and its sentiment label
first_review_id = movie_reviews.fileids()[0]
review_text = movie_reviews.raw(first_review_id)
review_sentiment = first_review_id.split('/')[0]  # Extract sentiment from file ID

# Apply tokenization on the review
tokens = word_tokenize(review_text)

# Filter out the punctuation, keeping only word tokens
word_tokens = [token for token in tokens if token.isalpha()]

# Print the sentiment label and the first 10 word tokens
print(f"\nSentiment of the first review: {review_sentiment}")
print(f"\nFirst 10 word tokens of the first review:\n", word_tokens[:10])
```

### Explanation:
- **Tokenization**: The `word_tokenize(review_text)` function tokenizes the entire content of the review, splitting it into individual elements like words and punctuation.
- **Filtering Punctuation**: The list comprehension `[token for token in tokens if token.isalpha()]` filters out any tokens that contain non-alphabetic characters (i.e., punctuation and numbers), keeping only word tokens.
- **Sentiment Extraction**: The sentiment label (`pos` for positive or `neg` for negative) is extracted from the file ID (`first_review_id`). The ID format typically includes the sentiment as the first part of the path (e.g., "pos/xxxxx.txt" or "neg/xxxxx.txt"), so splitting by '/' and taking the first item gives the sentiment.
- **Printing Results**: The script prints the sentiment of the review followed by the first 10 word tokens of the review.

This script effectively processes the first movie review in the dataset, providing insights into its sentiment and content through tokenization and filtering.

## Tokenizing First Review and Printing Tokens

Keep it up, Space Voyager! Now, write a script to load the movie_reviews dataset, find the index of the first movie review, get that review, then tokenize it and lastly print out the first 10 tokens.

```python
import nltk
from nltk.corpus import movie_reviews
from nltk import word_tokenize

nltk.download('movie_reviews', quiet=True)

# TODO: Get the index of the first movie review

# TODO: Load the first movie review using its index

# TODO: Apply tokenization on the review

# TODO: Print the first 10 tokens

```

To complete the script that loads the `movie_reviews` dataset, retrieves the first movie review, tokenizes it, and prints the first 10 tokens, follow these steps:

1. **Get the index of the first movie review**: The movie reviews in the `movie_reviews` corpus are accessed via file IDs. You can retrieve the first file ID directly.
2. **Load the review using its index**: Use the file ID to get the raw text of the review.
3. **Tokenize the review**: Use the `word_tokenize` function to break the text into tokens.
4. **Print the first 10 tokens**: Display the initial tokens to check the tokenization.

Here's how you can implement this:

```python
import nltk
from nltk.corpus import movie_reviews
from nltk import word_tokenize

nltk.download('movie_reviews', quiet=True)

# Get the index of the first movie review
first_review_id = movie_reviews.fileids()[0]  # File ID of the first review

# Load the first movie review using its index
review_text = movie_reviews.raw(first_review_id)  # Raw text of the first review

# Apply tokenization on the review
tokens = word_tokenize(review_text)  # Tokenize the text

# Print the first 10 tokens
print("\nThe first 10 tokens are:\n", tokens[:10])
```

### Explanation:
- **`first_review_id = movie_reviews.fileids()[0]`**: This retrieves the ID of the first review. The `fileids()` method lists all the file IDs in the dataset, and `[0]` accesses the first one.
- **`review_text = movie_reviews.raw(first_review_id)`**: This loads the raw text of the review identified by `first_review_id`.
- **`tokens = word_tokenize(review_text)`**: This line tokenizes the review text into words and punctuation.
- **`print("\nThe first 10 tokens are:\n", tokens[:10])`**: This prints the first 10 tokens of the tokenized review, giving you an initial look at the content.

This script effectively demonstrates how to access and process text data from the `movie_reviews` dataset, making it a useful example for learning text processing with Python and NLTK.