# Unit 3 Tokenizing Text Data in NLP with Python and NLTK

# Lesson Overview

Welcome to our new lesson on **Tokenization**. Tokenization is a form of textual data cleaning typically performed in **Natural Language Processing** (NLP). It transforms raw text into a more usable format by breaking it down into individual words or tokens. Our lesson uses Python, **NLTK (the Natural Language Toolkit)**, and the **pandas** library for data handling. We'll apply tokenization to the **SMS Spam Collection dataset** that you're already familiar with. Let's get started!

---

# Understanding Tokenization

Tokenization is the process of converting a sequence of text into separate pieces called tokens, usually words. When reading a text, our brain automatically identifies words without spaces, punctuation, or other separators, and understands the context. For computers, the process isn't that straightforward. They need to be taught to understand language structures, and that's when tokenization comes into play.

Tokenization plays a key role in various NLP tasks, including text classification, language modeling, and sentiment analysis. For instance, if we train a machine learning model to classify spam messages, tokenization helps split a message into individual words. Each word becomes a feature for our model to learn from.

One of the challenges with tokenization can be handling contractions. For example, the word "don't" might get tokenized into "don", "'", and "t" with a traditional whitespace tokenizer, which is incorrect. To mitigate this, we might need additional steps to handle contractions appropriately.

---

# Exploring the NLTK Library

NLTK, or Natural Language Toolkit, is a Python library that provides tools for handling human language data. It supplies easy-to-use interfaces to over 50 corpora and lexical resources, such as the `nltk.tokenize` package, which offers several tokenizer functions including `word_tokenize`, `sent_tokenize`, and more.

Before using `word_tokenize` for the first time, you might need to download the `punkt_tab` package using `nltk.download('punkt_tab', quiet=True)`. The `quiet=True` parameter suppresses output messages during the download process. This package includes a pre-trained model that helps NLTK effectively split ordinary text into tokens. It's especially tuned for splitting sentences into words, taking into account various language peculiarities and structures.

```python
import nltk
from nltk.tokenize import word_tokenize

# Ensure necessary packages are downloaded for tokenization
nltk.download('punkt_tab', quiet=True)
```

Downloading `punkt_tab` is necessary because `word_tokenize` relies on this model to distinguish between different parts of a sentence, such as words and punctuation, using an unsupervised machine learning algorithm. Without it, `word_tokenize` won't work.

---

# Implementing Tokenization Using NLTK

As you already know, the SMS Spam Collection dataset can be loaded directly into a pandas DataFrame for convenient handling.

As a first step, let's convert all messages to lowercase to ensure uniformity, because NLP models treat "hello" and "Hello" differently.

```python
# Convert all messages to lowercase for uniformity
df['processed_message'] = df['message'].apply(lambda x: x.lower())
```

Then we'll implement tokenization using the function `nltk.tokenize.word_tokenize()`.

```python
# Tokenize the messages into individual words
df['tokens'] = df['processed_message'].apply(lambda x: word_tokenize(x))
print(df['tokens'].head())
```

The output of the above code will be:

```
0    [go, until, jurong, point, ,, crazy, .., avail...
1             [ok, lar, ..., joking, wif, u, oni, ...]
2    [free, entry, in, 2, a, wkly, comp, to, win, f...
3    [u, dun, say, so, early, hor, ..., u, c, alrea...
4    [nah, i, do, n't, think, he, goes, to, usf, ,,...
Name: tokens, dtype: object
```

This output clearly demonstrates tokenization in action, where each message is split into a list of components or "tokens". This step is critical for preparing text data for further analysis in NLP tasks.

---

# Lesson Summary

Today, we learned about the concept of **tokenization** and its importance in the context of Natural Language Processing. Utilizing the power of the `nltk` library in Python, we explored how tokens, the individual pieces of text, can be extracted from raw text data for further processing. Now, it's your turn to practice and refine your tokenization skills with a series of exercises. Remember, the more you practice, the better you become at working with Natural Language Processing tasks! Happy learning!



## Efficient Text Preprocessing with NLTK

n this practice, we'll tokenize text data from the SMS Spam Collection using Python and NLTK, lowercasing for uniform analysis and splitting messages into tokens. Run the existing code directly to see tokenization in action.

```python
import pandas as pd
from datasets import load_dataset
import nltk
from nltk.tokenize import word_tokenize

# Ensure necessary packages are downloaded for tokenization
nltk.download('punkt_tab', quiet=True)

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for easy handling
df = pd.DataFrame(sms_spam['train'])

# Convert all messages to lowercase for uniformity
df['processed_message'] = df['message'].apply(lambda x: x.lower())

# Tokenize the messages into individual words
df['tokens'] = df['processed_message'].apply(lambda x: word_tokenize(x))
print(df['tokens'].head())


```

```markdown
## Efficient Text Preprocessing with NLTK

In this practice, we'll tokenize text data from the SMS Spam Collection using Python and NLTK, lowercasing for uniform analysis and splitting messages into tokens. Run the existing code directly to see tokenization in action.

```python
import pandas as pd
from datasets import load_dataset
import nltk
from nltk.tokenize import word_tokenize

# Ensure necessary packages are downloaded for tokenization
nltk.download('punkt_tab', quiet=True)

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for easy handling
df = pd.DataFrame(sms_spam['train'])

# Convert all messages to lowercase for uniformity
df['processed_message'] = df['message'].apply(lambda x: x.lower())

# Tokenize the messages into individual words
df['tokens'] = df['processed_message'].apply(lambda x: word_tokenize(x))
print(df['tokens'].head())
```
```

## Streamlining Text Processing with NLTK

In this exercise, you'll be diving into a common task within text data preprocessing—tokenization. Specifically, you are presented with a script designed to tokenize SMS text messages, a crucial step for analyzing such data. However, this script isn't functioning as expected due to a missing necessary step. Your mission is to identify and correct it to ensure the text messages are effectively tokenized.

```python
import pandas as pd
from datasets import load_dataset
import nltk
from nltk.tokenize import word_tokenize

# Ensure necessary packages are downloaded for tokenization

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for easy handling
df = pd.DataFrame(sms_spam['train'])

# Convert all messages to lowercase for uniformity
df['processed_message'] = df['message'].apply(lambda x: x.lower())

# Tokenize the messages into individual words
df['tokens'] = df['processed_message'].apply(lambda x: word_tokenize(x))
print(df['tokens'].head())

```

## Streamlining Text Processing with NLTK

In this exercise, you'll be diving into a common task within text data preprocessing—tokenization. Specifically, you are presented with a script designed to tokenize SMS text messages, a crucial step for analyzing such data. However, this script isn't functioning as expected due to a missing necessary step. Your mission is to identify and correct it to ensure the text messages are effectively tokenized.

```python
import pandas as pd
from datasets import load_dataset
import nltk
from nltk.tokenize import word_tokenize

# Ensure necessary packages are downloaded for tokenization
# Fix: The 'punkt_tab' package needs to be downloaded for word_tokenize to work.
nltk.download('punkt_tab', quiet=True)

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for easy handling
df = pd.DataFrame(sms_spam['train'])

# Convert all messages to lowercase for uniformity
df['processed_message'] = df['message'].apply(lambda x: x.lower())

# Tokenize the messages into individual words
df['tokens'] = df['processed_message'].apply(lambda x: word_tokenize(x))
print(df['tokens'].head())

```

## Implementing Tokenization Basics

Having familiarized yourself with the basics of tokenization and explored NLTK's toolkit, now you are tasked with completing a partially written Python script. Fill in the blank to add the missing part of the code to tokenize the messages.

```python
import pandas as pd
from datasets import load_dataset
import nltk
from nltk.tokenize import word_tokenize

# Ensure necessary packages are downloaded for tokenization
nltk.download('punkt_tab', quiet=True)

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Convert all messages to lowercase for uniformity
df['processed_message'] = df['message'].apply(lambda x: x.lower())

# TODO: Tokenize the messages into individual words
df['tokens'] = df['processed_message'].apply(lambda x: _____________(x))
print(df['tokens'].head())

```


```python
import pandas as pd
from datasets import load_dataset
import nltk
from nltk.tokenize import word_tokenize

# Ensure necessary packages are downloaded for tokenization
nltk.download('punkt_tab', quiet=True)

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Convert all messages to lowercase for uniformity
df['processed_message'] = df['message'].apply(lambda x: x.lower())

# TODO: Tokenize the messages into individual words
df['tokens'] = df['processed_message'].apply(lambda x: word_tokenize(x))
print(df['tokens'].head())
```


## Mastering Tokenization with NLTK

After delving into the complexities of tokenization and the versatility of the NLTK library, you are now well-equipped to take on this final exercise. This task requires you to manually implement the code for tokenization of the SMS messages. This step is fundamental for any text analysis endeavor in natural language processing (NLP).

```python
import pandas as pd
from datasets import load_dataset
import nltk
from nltk.tokenize import word_tokenize

# Ensure necessary packages are downloaded for tokenization
nltk.download('punkt_tab', quiet=True)

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection', split='train')

# Convert to pandas DataFrame for easy handling
df = pd.DataFrame(sms_spam)

# TODO: Tokenize the messages into individual words using NLTK's word_tokenize

# TODO: Print the first 5 entries of the tokens to verify the tokenization process

```

```python
import pandas as pd
from datasets import load_dataset
import nltk
from nltk.tokenize import word_tokenize

# Ensure necessary packages are downloaded for tokenization
nltk.download('punkt_tab', quiet=True)

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection', split='train')

# Convert to pandas DataFrame for easy handling
df = pd.DataFrame(sms_spam)

# Rename the column from 'sms_message' to 'message'
df = df.rename(columns={'sms_message': 'message'})

# Convert messages to lowercase and then tokenize into individual words using NLTK's word_tokenize
df['tokens'] = df['message'].apply(lambda x: word_tokenize(x.lower()))

# Print the first 5 entries of the tokens to verify the tokenization process
print(df['tokens'].head())

```