# Lesson 4: Text Preprocessing for Deep Learning with TensorFlow


Welcome, data enthusiasts! In this lesson, we will continue our journey into the world of Natural Language Processing (NLP), with an introduction to deep learning for text classification. To harness the power of deep learning, it's important to start with proper data preparation. That's why we will focus today on text preprocessing, shifting from Scikit-learn, which we used previously in this course, to the powerful TensorFlow library.

The goal of this lesson is to leverage TensorFlow for textual data preparation and understand how it differs from the methods we used earlier. We will implement tokenization, convert tokens into sequences, learn how to pad these sequences to a consistent length, and transform categorical labels into integer labels to input into our deep learning model. Let's dive in!

## Understanding TensorFlow and its Role in Text Preprocessing

TensorFlow is an open-source library developed by Google, encompassing a comprehensive ecosystem of tools, libraries, and resources that facilitate machine learning and deep learning tasks, including NLP. As with any machine learning task, preprocessing of your data is a key step in NLP as well.

A significant difference between text preprocessing with TensorFlow and using libraries like Scikit-learn lies in the approach to tokenization and sequence generation. TensorFlow incorporates a highly efficient tokenization process, handling both tokenization and sequence generation within the same library. Let's understand how this process works.

## Tokenizing Text Data

Tokenization is a foundational step in NLP, where sentences or texts are segmented into individual words or tokens. This process facilitates the comprehension of the language structure and produces meaningful units of text that serve as input for numerous machine learning algorithms.

In TensorFlow, we utilize the `Tokenizer` class for tokenization. A unique feature of TensorFlow's tokenizer is its robust handling of 'out-of-vocabulary' (OOV) words, or words not present in the tokenizer's word index. By specifying the `oov_token` parameter, we can assign a special token, `<OOV>`, to represent these OOV words.

### Example of Tokenization:

```python
from tensorflow.keras.preprocessing.text import Tokenizer

sentence = "Love is a powerful entity."
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts([sentence])
word_index = tokenizer.word_index
print(word_index)
```

#### Output:

```plaintext
{'<OOV>': 1, 'love': 2, 'is': 3, 'a': 4, 'powerful': 5, 'entity': 6}
```

Through this mechanism, TensorFlow's `Tokenizer` effectively prepares text data for subsequent machine learning tasks by mapping words to consistent integer values while gracefully handling words not encountered during the initial vocabulary construction.

## Converting Text to Sequences

After tokenization, the next step is to represent text as sequences of integers. Sequences are lists of integers where each integer corresponds to a token in the dictionary created during tokenization.

### Example of Converting Text to Sequences:

```python
sentences = [sentence, "very powerful"]
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)
```

#### Output:

```plaintext
[[2, 3, 4, 5, 6], [1, 5]]
```

The word “very” is not found in the tokenizer's word index, thus it is labeled as token `1`, which we designated as the `<OOV>` token. The word “powerful”, being recognized in the vocabulary, retains its assigned index `5`.

## Padding Sequences for Consistent Input Shape

Deep learning models require input data of a consistent shape. Padding ensures this by adding zeros to shorter sequences to match the length of the longest sequence.

### Example of Padding Sequences:

```python
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequences = pad_sequences(sequences, padding='post')
print(padded_sequences)
```

#### Output:

```plaintext
[[2 3 4 5 6]
 [1 5 0 0 0]]
```

The padding ensures all sequences are unified in length, catering to the requirements of deep learning models for consistent input shape.

## Implementing Text Preprocessing with TensorFlow

Finally, let's implement the entire preprocessing workflow with a limited set of data from the Reuters-21578 text categorization dataset.

### Full Implementation:

```python
# Import necessary libraries
import nltk
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import reuters

# Download the reuters dataset from nltk
nltk.download('reuters', quiet=True)

# Limiting the data for quick execution
categories = reuters.categories()[:3]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Tokenize the text data, using TensorFlow's Tokenizer class
tokenizer = Tokenizer(num_words=500, oov_token="<OOV>")
tokenizer.fit_on_texts(text_data)
sequences = tokenizer.texts_to_sequences(text_data)

# Padding sequences for uniform input shape
X = pad_sequences(sequences, padding='post')

# Translating categories into numerical labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(categories_data)

print("Shape of X: ", X.shape)
print("Shape of Y: ", y.shape)
```

#### Output:

```plaintext
Shape of X:  (2477, 2380)
Shape of Y:  (2477,)
```

## Conclusion

Great work! You've successfully ventured into TensorFlow for text preprocessing, an essential step in leveraging the true potential of deep learning for text classification. You've seen how tokenization, sequence creation, and padding can be swiftly handled in TensorFlow, a key difference from methods we used in Scikit-learn. These foundations will serve you well as we move forward in our NLP journey. Up next, we're diving deeper into building Neural Network Models for Text Classification!



## Adjusting Tokenizer Parameters

Great work so far, Stellar Navigator. Now, let's adjust the parameters of the tokenizer. Initially, we have set the num_words parameter in the Tokenizer class to 10. Change the num_words parameter to 5. Run the given sentence through the tokenizer once more and observe the differences in the word_index and how the tokens outside the top limited number are labeled as in the sequence.

```python
from tensorflow.keras.preprocessing.text import Tokenizer

# Original sentence
sentence = "Love is a powerful entity that can change the world."

tokenizer = Tokenizer(num_words=10, oov_token="<OOV>")
tokenizer.fit_on_texts([sentence])
word_index = tokenizer.word_index

# Print word index
print("Updated word index: ", word_index)

# Translating text to sequences
sequence = tokenizer.texts_to_sequences([sentence])
print("Updated sequence: ", sequence)
```

## Introduction to Deep Learning for Text Classification

Welcome, data enthusiasts! In this lesson, we will continue our journey into the world of Natural Language Processing (NLP), with an introduction to deep learning for text classification. To harness the power of deep learning, it's important to start with proper data preparation. That's why we will focus today on text preprocessing, shifting from Scikit-learn, which we used previously in this course, to the powerful TensorFlow library.

The goal of this lesson is to leverage TensorFlow for textual data preparation and understand how it differs from the methods we used earlier. We will implement tokenization, convert tokens into sequences, learn how to pad these sequences to a consistent length, and transform categorical labels into integer labels to input into our deep learning model. Let's dive in!

## Understanding TensorFlow and its Role in Text Preprocessing

TensorFlow is an open-source library developed by Google, encompassing a comprehensive ecosystem of tools, libraries, and resources that facilitate machine learning and deep learning tasks, including NLP. As with any machine learning task, preprocessing of your data is a key step in NLP as well.

A significant difference between text preprocessing with TensorFlow and using libraries like Scikit-learn, lies in the approach to tokenization and sequence generation. TensorFlow incorporates a highly efficient tokenization process, handling both tokenization and sequence generation within the same library. Let's understand how this process works.

## Tokenizing Text Data

Tokenization is a foundational step in NLP, where sentences or texts are segmented into individual words or tokens. This process facilitates the comprehension of the language structure and produces meaningful units of text that serve as input for numerous machine learning algorithms.

In TensorFlow, we utilize the Tokenizer class for tokenization. A unique feature of TensorFlow's tokenizer is its robust handling of 'out-of-vocabulary' (OOV) words, or words not present in the tokenizer's word index. By specifying the oov_token parameter, we can assign a special token, <OOV>, to represent these OOV words.

Let's look at a practical example of tokenization:

```python
from tensorflow.keras.preprocessing.text import Tokenizer

# Original sentence
sentence = "Love is a powerful entity that can change the world."

tokenizer = Tokenizer(num_words=5, oov_token="<OOV>")
tokenizer.fit_on_texts([sentence])
word_index = tokenizer.word_index

# Print word index
print("Updated word index: ", word_index)

# Translating text to sequences
sequence = tokenizer.texts_to_sequences([sentence])
print("Updated sequence: ", sequence)
```

### Output:
```plaintext
Updated word index: {'<OOV>': 1, 'love': 2, 'is': 3, 'a': 4, 'powerful': 5}
Updated sequence: [[2, 3, 4, 5, 1, 1, 1, 1, 1, 1]]
```

In this example, tokenizer.fit_on_texts([...]) examines the text it receives and constructs a vocabulary from the unique words found within, but now with a restriction of num_words=5. Any word beyond the first five most frequent ones is replaced with the <OOV> token. This adjustment affects how unknown words are represented in the generated sequences.

Through this mechanism, TensorFlow's Tokenizer effectively prepares text data for subsequent machine learning tasks by mapping words to consistent integer values while gracefully handling words not encountered during the initial vocabulary construction.

## Converting Text to Sequences

After tokenization, the next step is to represent text as sequences of integers. Sequences are lists of integers where each integer corresponds to a token in the dictionary created during tokenization. This conversion process translates natural language text into structured data that can be input into a machine learning model.

## Padding Sequences for Consistent Input Shape

Deep learning models require input data of a consistent shape. In the context of NLP, it means all text must be represented by the same number of tokens. Padding is a process to ensure this by adding zeros to shorter sequences to match the length of the longest sequence.

Here's how we pad sequences in TensorFlow:

```python
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequences = pad_sequences(sequences, padding='post')
print(padded_sequences)
```

## Implementing Text Preprocessing with TensorFlow

Finally, let's implement the entire preprocessing workflow with a limited set of data from the Reuters-21578 text categorization dataset.

```python
# Import necessary libraries
import nltk
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import reuters

# Download the reuters dataset from nltk
nltk.download('reuters', quiet=True)

# Limiting the data for quick execution
categories = reuters.categories()[:3]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Tokenize the text data, using TensorFlow's Tokenizer class
tokenizer = Tokenizer(num_words=500, oov_token="<OOV>")
tokenizer.fit_on_texts(text_data)
sequences = tokenizer.texts_to_sequences(text_data)

# Padding sequences for uniform input shape
X = pad_sequences(sequences, padding='post')

# Translating categories into numerical labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(categories_data)

print("Shape of X: ", X.shape)
print("Shape of Y: ", y.shape)
```

### Output:
```plaintext
Shape of X:  (2477, 2380)
Shape of Y:  (2477,)
```

## Conclusion

Great work! You've successfully ventured into TensorFlow for text preprocessing, an essential step in leveraging the true potential of deep learning for text classification. You've seen how tokenization, sequence creation, and padding can be swiftly handled in TensorFlow, a key difference from methods we used in Scikit-learn. These foundations will serve you well as we move forward in our NLP journey. Up next, we're diving deeper into building Neural Network Models for Text Classification!



## Tokenizer Text Processing Practice

Alright, Stellar Navigator! Your challenge is to modify our Tokenizer to tokenize a different text. Fill in the missing parts (____) to process the new text sample. Use 10 for the number of words and "<OOV>" for out-of-vocabulary words. Remember to fit the tokenizer to the sample and generate a sequence from it. Additionally, print the sequence and the word_index to see the results of your processing.

```python
# Import necessary libraries
from tensorflow.keras.preprocessing.text import Tokenizer

# Define text sample
text_sample = "The quick brown fox jumps over the lazy dog."

# Define the tokenizer
tokenizer = Tokenizer(num_words=____, oov_token=____)

# Fit the tokenizer on the text sample
tokenizer.____([text_sample])

# Retrieve the word index
word_index = tokenizer.word_index

# Convert the text to a sequence
sequence = tokenizer.____([text_sample])

# Print the sequence and word index
print(f"Sequence: {sequence}")
print(f"Word Index: {word_index}")

```

Here's your updated code with the missing parts filled in:  

```python
# Import necessary libraries
from tensorflow.keras.preprocessing.text import Tokenizer

# Define text sample
text_sample = "The quick brown fox jumps over the lazy dog."

# Define the tokenizer
tokenizer = Tokenizer(num_words=10, oov_token="<OOV>")

# Fit the tokenizer on the text sample
tokenizer.fit_on_texts([text_sample])

# Retrieve the word index
word_index = tokenizer.word_index

# Convert the text to a sequence
sequence = tokenizer.texts_to_sequences([text_sample])

# Print the sequence and word index
print(f"Sequence: {sequence}")
print(f"Word Index: {word_index}")
```

This will tokenize the sentence, limit the vocabulary size to 10 words, and replace any out-of-vocabulary words with `<OOV>`. Let me know if you need further modifications! 🚀

## Filling the Gaps in Text Preprocessing Code

Fantastic! Now, Stellar Navigator, let's fill in the code at the marked spots (____)! We are focusing on some key parts of our text preprocessing pipeline. In particular, you need to correctly categorize the data, specify the number of words for tokenization, and ensure the text data is properly tokenized. Happy coding!

```python
# Import necessary libraries
import nltk
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import reuters

# Limiting the data for quick execution
categories = reuters.categories()[:3]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.____(fileid)[0] for fileid in documents]

# Tokenize the text data, using TensorFlow's Tokenizer class
tokenizer = ____(num_words=____, oov_token="<OOV>")
tokenizer.____(text_data)
sequences = tokenizer.____(text_data)

# Padding sequences for uniform input shape
X = pad_sequences(sequences, padding='post')

# Translating categories into numerical labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(categories_data)

print("Shape of X: ", X.shape)
print("Shape of Y: ", y.shape)

```

Here’s your completed code with the missing parts filled in:  

```python
# Import necessary libraries
import nltk
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import reuters

# Limiting the data for quick execution
categories = reuters.categories()[:3]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Tokenize the text data, using TensorFlow's Tokenizer class
tokenizer = Tokenizer(num_words=500, oov_token="<OOV>")
tokenizer.fit_on_texts(text_data)
sequences = tokenizer.texts_to_sequences(text_data)

# Padding sequences for uniform input shape
X = pad_sequences(sequences, padding='post')

# Translating categories into numerical labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(categories_data)

print("Shape of X: ", X.shape)
print("Shape of Y: ", y.shape)
```

### 🔍 Breakdown of Fixes:
1. `categories_data = [reuters.categories(fileid)[0] for fileid in documents]`  
   - Extracts the primary category for each document.
2. `tokenizer = Tokenizer(num_words=500, oov_token="<OOV>")`  
   - Initializes the Tokenizer with a vocabulary limit of 500 words.
3. `tokenizer.fit_on_texts(text_data)`  
   - Trains the tokenizer on the dataset.
4. `sequences = tokenizer.texts_to_sequences(text_data)`  
   - Converts text into sequences of tokenized integers.

Now, your text preprocessing pipeline is complete! 🚀 Let me know if you need further modifications.

## Initiating the Tokenizer Process

Alright, Stellar Navigator! Your next task is to initialize the tokenizer. The text sample and the fit function are already provided. You merely need to fill in the line for the tokenizer. Use 10 for the number of words and "<OOV>" for out-of-vocabulary words. After fitting the tokenizer, print the word index and sequence to see the results of your tokenizer.

```python
# Import necessary libraries
from tensorflow.keras.preprocessing.text import Tokenizer

# Define text sample
text_sample = "The quick brown fox jumps over the lazy dog."

# TODO: Initialize the tokenizer

# Fit the tokenizer on the text sample
tokenizer.fit_on_texts([text_sample])

# Get word index and sequence
word_index = tokenizer.word_index
print("Word Index: ", word_index)

sequence = tokenizer.texts_to_sequences([text_sample])
print("Sequence: ", sequence)

```

Here’s the completed code with the tokenizer properly initialized:

```python
# Import necessary libraries
from tensorflow.keras.preprocessing.text import Tokenizer

# Define text sample
text_sample = "The quick brown fox jumps over the lazy dog."

# TODO: Initialize the tokenizer
tokenizer = Tokenizer(num_words=10, oov_token="<OOV>")  # ✅ Correctly initializes tokenizer

# Fit the tokenizer on the text sample
tokenizer.fit_on_texts([text_sample])

# Get word index and sequence
word_index = tokenizer.word_index
print("Word Index: ", word_index)

sequence = tokenizer.texts_to_sequences([text_sample])
print("Sequence: ", sequence)
```

### ✅ What's Implemented?
1. **Tokenizer Initialization:**
   - `num_words=10`: Limits vocabulary to the top 10 words.
   - `oov_token="<OOV>"`: Handles out-of-vocabulary words.

2. **Tokenization Process:**
   - `fit_on_texts([text_sample])`: Learns word frequencies.
   - `texts_to_sequences([text_sample])`: Converts text to numerical tokens.

### 🔍 Expected Output:
- The word index will show the most frequent words assigned unique numbers.
- The sequence will show the numerical representation of the input text.

Give it a run and see how it tokenizes the sentence! 🚀🔥

## Tokenizing Text Data with TensorFlow

Impressive progress, Stellar Navigator! Now, harness the power of TensorFlow and tokenize unique text data of your choice from scratch. After implementing the process of tokenization, print the word index and sequences to verify your implementation!

```python
from tensorflow.keras.preprocessing.text import Tokenizer

text_data = "The sky is blue"

# TODO: Initialize tokenizer with num_words and oov_token parameters

# TODO: Fit tokenizer on your text data

# TODO: Generate word index and print it

# TODO: Convert text data into a sequence and print it
```

Here’s the complete implementation with tokenization from scratch:  

```python
from tensorflow.keras.preprocessing.text import Tokenizer

# Define unique text data
text_data = "The sky is blue"

# TODO: Initialize tokenizer with num_words and oov_token parameters
tokenizer = Tokenizer(num_words=10, oov_token="<OOV>")  # Limits vocabulary to top 10 words and handles OOV words

# TODO: Fit tokenizer on your text data
tokenizer.fit_on_texts([text_data])

# TODO: Generate word index and print it
word_index = tokenizer.word_index
print("Word Index:", word_index)  # Shows how words are indexed

# TODO: Convert text data into a sequence and print it
sequence = tokenizer.texts_to_sequences([text_data])
print("Sequence:", sequence)  # Outputs tokenized numerical representation
```

### ✅ What This Does:
1. **Tokenizer Initialization:**
   - `num_words=10`: Restricts vocabulary size.
   - `oov_token="<OOV>"`: Assigns unknown words a default token.

2. **Tokenization Process:**
   - `fit_on_texts([text_data])`: Learns word frequencies.
   - `texts_to_sequences([text_data])`: Converts words to numerical tokens.

### 🔍 Expected Output:
- **Word Index:** A dictionary mapping each word to a unique number.
- **Sequence:** A list of numbers representing words in the sentence.

Try running it and see how it tokenizes your text! 🚀🔥