# Unit 1

## Introduction to Tokenization (Rule-Based Tokenization)

# Introduction to Tokenization

Welcome to the first lesson of our course on **Modern Tokenization Techniques for AI & LLMs**. In this lesson, we will explore the concept of **tokenization**, a fundamental step in **Natural Language Processing** (NLP). Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, sentences, or even characters, depending on the level of granularity required.

Tokenization is crucial because it transforms raw text into a format that can be easily processed by AI models, enabling them to understand and generate human language. Additionally, tokenization helps in reducing the complexity of text data, making it easier to analyze and manipulate. It is the first step in many NLP pipelines, serving as the foundation for tasks such as parsing, part-of-speech tagging, and named entity recognition.

## Recall: Python Libraries for NLP

Before we dive into tokenization techniques, let's briefly recall the importance of Python libraries in NLP. Libraries like NLTK (Natural Language Toolkit) and spaCy provide powerful tools for text processing, making complex tasks like tokenization more manageable. While we have touched on these libraries before, it's important to remember that they offer pre-built functions that save time and effort, allowing us to focus on building and refining our models. These libraries also come with extensive documentation and community support, which can be invaluable when troubleshooting or seeking to extend their functionality. Furthermore, they are optimized for performance, enabling efficient processing of large datasets, which is essential when working with LLMs.

## Understanding Rule-Based Tokenization

Rule-based tokenization involves using predefined rules to split text into tokens. This method is straightforward and effective for many applications. Unlike statistical or machine learning-based tokenization, rule-based tokenization relies on patterns such as spaces, punctuation, or regular expressions to identify token boundaries. While it is fast and easy to implement, it may not handle all edge cases, such as contractions or special characters, as effectively as more advanced methods. Rule-based tokenization is often used in scenarios where the text structure is predictable and consistent, such as processing log files or structured documents. However, it may require manual adjustments to handle language-specific nuances or domain-specific jargon.

## NLTK Tokenization Techniques

Let's explore how to perform tokenization using **NLTK**, a popular library for NLP tasks. NLTK provides a variety of tokenization methods, each suited for different types of text and analysis needs. It is widely used in academic research and educational settings due to its comprehensive suite of tools and ease of use.

### Word Tokenization with NLTK

First, we'll use NLTK's `word_tokenize` function to split a sentence into individual words.

```python
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt_tab')

# Sample text with two sentences
txt = "Dr. John O'Reilly’s AI-based startup raised $10M in 2023. The company plans to expand globally next year."

# NLTK word tokenization
nltk_word_tokens = word_tokenize(txt)

print("NLTK Word Tokenization:", nltk_word_tokens)
```

**Explanation:**

  - We import the necessary functions from NLTK and download the `punkt` package, which is required for tokenization.
  - The `word_tokenize` function splits the text into words, handling punctuation and special characters.
  - The output is a list of words: `['Dr.', 'John', "O'Reilly", '’', 's', 'AI-based', 'startup', 'raised', '$', '10M', 'in', '2023', '.', 'The', 'company', 'plans', 'to', 'expand', 'globally', 'next', 'year', '.']`.
  - This method is particularly useful for tasks that require word-level analysis, such as sentiment analysis or word frequency counting.

### Sentence Tokenization with NLTK

Next, we'll use `sent_tokenize` to split text into sentences.

```python
from nltk.tokenize import sent_tokenize

txt = "Dr. John O'Reilly’s AI-based startup raised $10M in 2023. The company plans to expand globally next year."

# NLTK sentence tokenization
nltk_sentence_tokens = sent_tokenize(txt)

print("NLTK Sentence Tokenization:", nltk_sentence_tokens)
```

**Explanation:**

  - The `sent_tokenize` function divides the text into sentences.
  - The output is a list containing the sentences: `["Dr. John O'Reilly’s AI-based startup raised $10M in 2023.", "The company plans to expand globally next year."]`
  - Sentence tokenization is crucial for tasks that require understanding the context or flow of information, such as summarization or translation.

### Regex-Based Tokenization with NLTK

Finally, we'll use `regexp_tokenize` to tokenize text based on a regular expression pattern.

```python
from nltk.tokenize import regexp_tokenize

txt = "Dr. John O'Reilly’s AI-based startup raised $10M in 2023. The company plans to expand globally next year."

# NLTK regex-based tokenization
nltk_regex_tokens = regexp_tokenize(txt, pattern=r'\w+|\$[\d\.]+|\S')

print("NLTK Regex Tokenization:", nltk_regex_tokens)
```

**Explanation:**

  - The `regexp_tokenize` function uses a regular expression to define token boundaries.
  - The pattern `r'\w+|\$[\d\.]+|\S'` matches words, dollar amounts, and non-whitespace characters.
  - The output is a list of tokens: `['Dr', '.', 'John', 'O', "'", 'Reilly', '’', 's', 'AI', '-', 'based', 'startup', 'raised', '$10M', 'in', '2023', '.', 'The', 'company', 'plans', 'to', 'expand', 'globally', 'next', 'year', '.']`.
  - Regex-based tokenization offers flexibility and precision, allowing customization for specific tokenization needs, such as extracting dates, numbers, or specific patterns.

## spaCy Tokenization

Now, let's briefly see how **spaCy** handles tokenization. spaCy is known for its speed and efficiency in processing large volumes of text. It is designed for production use and offers a range of features beyond tokenization, such as part-of-speech tagging, dependency parsing, and named entity recognition.

```python
import spacy

txt = "Dr. John O'Reilly’s AI-based startup raised $10M in 2023. The company plans to expand globally next year."

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# spaCy tokenization
doc = nlp(txt)
spacy_tokens = [token.text for token in doc]

print("spaCy Tokenization:", spacy_tokens)
```

**Explanation:**

  - We load the spaCy model `en_core_web_sm`, which is a small English model.
  - The `nlp` function processes the text, and we extract tokens using a list comprehension.
  - The output is a list of tokens: `['Dr.', 'John', 'O', "'", 'Reilly', '’s', 'AI', '-', 'based', 'startup', 'raised', '$', '10M', 'in', '2023', '.', 'The', 'company', 'plans', 'to', 'expand', 'globally', 'next', 'year', '.']`.
  - spaCy's tokenization is highly efficient and can handle large datasets quickly, making it suitable for real-time applications.

## Comparing NLTK and spaCy Tokenization

Let's compare how NLTK and spaCy handle the tokenization of "O'Reilly’s". Both libraries provide similar outputs, but there are subtle differences in how they handle punctuation and special characters. Here's a side-by-side comparison of the tokenization results for "O'Reilly’s":

  - **NLTK Tokens:** `["O'Reilly", '’', 's']`
  - **spaCy Tokens:** `['O', "'", 'Reilly', '’s']`

The choice between NLTK and spaCy depends on the specific requirements of your project, such as speed, accuracy, and ease of use. NLTK is often preferred for educational purposes and research, while spaCy is favored in industry settings for its performance and additional NLP capabilities.

## Summary and Preparation for Practice

In this lesson, we introduced the concept of tokenization and explored rule-based tokenization techniques using NLTK and spaCy. We learned how to tokenize text into words and sentences and compared the outputs of both libraries. As you move on to the practice exercises, focus on applying these techniques to different text samples and observe how tokenization affects the structure and meaning of the text. This foundational knowledge will be crucial as we delve deeper into data processing for LLMs in future lessons. Understanding the nuances of tokenization will also help you make informed decisions when selecting or designing tokenization strategies for specific NLP tasks.

## Tokenize Text with NLTK

You've done well exploring tokenization techniques! Now, let's put your skills to the test. Your task is to:

Use NLTK's word_tokenize function to break down a text paragraph into words.
Count the total number of tokens generated.
This exercise will deepen your understanding of tokenization. Dive in and see how NLTK handles different text elements!

```python
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab', quiet = True)

# Sample text
txt = "Dr. John O'Reilly’s AI-based startup raised $10M in 2023. It’s amazing how quickly they’ve grown! They’re planning to expand to Europe, Asia, and beyond. Isn’t it exciting?"

# TODO: Use NLTK's word_tokenize function to tokenize the text

# TODO: Count the total number of tokens
```

```python
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab', quiet=True)

# Sample text
txt = "Dr. John O'Reilly’s AI-based startup raised $10M in 2023. It’s amazing how quickly they’ve grown! They’re planning to expand to Europe, Asia, and beyond. Isn’t it exciting?"

# Use NLTK's word_tokenize function to tokenize the text
nltk_word_tokens = word_tokenize(txt)

# Count the total number of tokens
token_count = len(nltk_word_tokens)

print("NLTK Word Tokenization:", nltk_word_tokens)
print("Total number of tokens:", token_count)
```

## Sentence Tokenization with NLTK

Nice job exploring tokenization techniques! Now, let's apply what you've learned. Your task is to:

Use NLTK's sent_tokenize function to split a complex paragraph into sentences.
Print the resulting sentences.
Compare the number of sentences identified by NLTK with what a human might consider distinct sentences.
This exercise will help you understand how NLTK handles tricky sentence boundaries. Give it a try and see how well it performs!

```python
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt_tab', quiet = True)

# Sample complex text
txt = "Dr. Smith graduated from the University. He earned his Ph.D. in 2010! Can you believe it? 'Yes,' she replied. 'It's true.'"

# TODO: Use sent_tokenize to split the text into sentences

# TODO: Compare the number of sentences identified by NLTK with human perception
print("Number of sentences a human might consider:", 5)  # Example human count

```

import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt_tab', quiet=True)

# Sample complex text
txt = "Dr. Smith graduated from the University. He earned his Ph.D. in 2010! Can you believe it? 'Yes,' she replied. 'It's true.'"

# Use sent_tokenize to split the text into sentences
nltk_sentences = sent_tokenize(txt)

# Compare the number of sentences identified by NLTK with human perception
print("NLTK identified the following sentences:")
for i, sentence in enumerate(nltk_sentences):
    print(f"{i + 1}. {sentence}")

print("\nNumber of sentences identified by NLTK:", len(nltk_sentences))
print("Number of sentences a human might consider:", 5)

# NLTK's sent_tokenize handles cases like "Dr." and "Ph.D." correctly by not splitting them,
# and it also correctly identifies sentence boundaries after punctuation like '!', '?', and '.', even within quotes.
# This results in a count that aligns well with human perception for this specific text.

## Extract Monetary Values with Regex

Cosmo
Just now
Read message aloud
You've done a great job learning about tokenization! Now, let's dive into a practical task. Your objective is to:

Modify the regular expression pattern in the regexp_tokenize function to extract monetary values from a text.
Identify and extract amounts like "$10", "$10.5M", "€20", etc.
Ensure that other numerical data is ignored.
To help you get started, here's a brief guide on how to construct the regex pattern:

Use [\$€] to match the currency symbols $ or €.
Follow it with \d+ to match one or more digits.
Use (?:\.\d+)? to optionally match a decimal point followed by one or more digits. The ?: makes it a non-capturing group.
Add [MK]? to optionally match the letters M or K, which often denote million or thousand.
This exercise will enhance your skills in customizing tokenization for specific information extraction. Let's see how well you can tailor the regex pattern!

```python
import nltk
from nltk.tokenize import regexp_tokenize
nltk.download('punkt_tab', quiet = True)

# Sample text with various information
txt ="Dr. John O'Reilly’s AI-based startup raised $10M in 2023. The company plans to expand globally next year. They also received €20 million from investors and have a 5% growth rate. Additionally, they secured $100K from a local grant. Contact: john@example.com"

# TODO: Modify the regex pattern to extract monetary values
pattern = _____

# TODO: Use NLTK's regexp_tokenize function with  regex pattern to extract monetary values from the text

```

```python
import nltk
from nltk.tokenize import regexp_tokenize
nltk.download('punkt_tab', quiet=True)

# Sample text with various information
txt = "Dr. John O'Reilly’s AI-based startup raised $10M in 2023. The company plans to expand globally next year. They also received €20 million from investors and have a 5% growth rate. Additionally, they secured $100K from a local grant. Contact: john@example.com"

# TODO: Modify the regex pattern to extract monetary values
pattern = r'[\$€]\d+(?:\.\d+)?[MK]?'

# TODO: Use NLTK's regexp_tokenize function with the regex pattern to extract monetary values from the text
monetary_tokens = regexp_tokenize(txt, pattern)

print("Extracted Monetary Values:", monetary_tokens)
```

## Tokenization Showdown with NLTK and spaCy

You've done well exploring NLTK's tokenization techniques! Now, let's expand your skills by focusing on spaCy. Your task is to:

Tokenize a specialized text using spaCy.
Identify how spaCy handles contractions, hyphenated words, and special characters.
This exercise will deepen your understanding of spaCy's tokenization approach. Dive in and see how this library handles text!

```python
import spacy

# Sample text
txt = "OpenAI's GPT-4 is an AI-powered tool that won't disappoint."

# spaCy tokenization
# TODO: Load the spaCy model and tokenize the text

```
```python
import spacy

# Sample text
txt = "OpenAI's GPT-4 is an AI-powered tool that won't disappoint."

# spaCy tokenization
# Load the spaCy English model. Make sure you have it installed:
# python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

# Process the text to create a Doc object
doc = nlp(txt)

# Iterate over the tokens and print their text
print("spaCy Tokens:")
for token in doc:
    print(token.text)
```

### Explanation of spaCy's Tokenization

The output of the code will be:

```
spaCy Tokens:
OpenAI
's
GPT-4
is
an
AI-powered
tool
that
wo
n't
disappoint
.
```

Here's a breakdown of how spaCy's tokenization handles the specific elements in the sample text:

1.  **Contractions:** spaCy's default behavior is to split contractions like `"won't"` into separate tokens for the root word and the contraction part. In this case, `"won't"` is tokenized as `"wo"` and `"n't"`. This is a linguistically informed decision that separates the verb stem from the negation token.

2.  **Hyphenated Words:** spaCy treats `"AI-powered"` as a single token. This is often the desired behavior for compound adjectives, as the entire phrase functions as a single unit to describe the "tool." Unlike some simpler tokenizers, spaCy doesn't automatically split on the hyphen, recognizing that the words are closely related.

3.  **Special Characters and Punctuation:**

      * The possessive `'s` in `"OpenAI's"` is tokenized as a separate token (`'s`). This is a standard linguistic tokenization practice that separates the base noun from the possessive marker.
      * The number-letter combination `"GPT-4"` is correctly identified and kept as a single token.
      * The period at the end of the sentence (`.`) is correctly treated as its own token.


## Tokenization Showdown with NLTK and spaCy