# Session 1: Data Processing and Tokenization

Fundamental NLP data preprocessing and tokenization techniques.



## Text Preprocessing

Consider an example sentence: `"Hello World! NLP is amazing."`

In [1]:
### Session 1: Data Processing and Tokenization

# Introduction
# This notebook covers fundamental NLP data preprocessing and tokenization techniques.
# It includes both theoretical explanations and practical exercises.

## Section 1: Understanding Text Preprocessing

### What is Text Preprocessing?
# Text preprocessing is a crucial step in NLP to ensure data consistency, reduce noise, and optimize model performance.

# Common preprocessing steps:
# - Lowercasing
# - Removing punctuation
# - Tokenization (splitting text into smaller units)
# - Removing stopwords (optional, task-dependent)
# - Normalization (e.g., stemming, lemmatization)

# Let's implement some basic text preprocessing:

import re
from typing import List

def basic_preprocessing(text: str) -> str:
    """Performs basic text preprocessing."""
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-z0-9\s]', '', text)  # Remove punctuation
    return text

# Example usage:
text_sample = "Hello, World! NLP is amazing."
print("Original:", text_sample)
print("Processed:", basic_preprocessing(text_sample))

## Section 2: Tokenization

### What is Tokenization?
# Tokenization is the process of breaking text into smaller units (tokens).
# These can be words, subwords, or characters.

# Common types of tokenization:
# - Word-based tokenization (splitting on spaces)
# - Character-level tokenization
# - Subword tokenization (e.g., Byte-Pair Encoding, WordPiece, SentencePiece)

# Let's implement simple word-based and character-level tokenization:

def word_tokenize(text: str) -> List[str]:
    """Splits text into words."""
    return text.split()

def char_tokenize(text: str) -> List[str]:
    """Splits text into characters."""
    return list(text)

# Example usage:
processed_text = basic_preprocessing(text_sample)
print("Word Tokens:", word_tokenize(processed_text))
print("Character Tokens:", char_tokenize(processed_text))

## Section 3: Subword Tokenization

### What is Subword Tokenization?
# Subword tokenization helps handle rare words and reduce vocabulary size.
# Examples: Byte-Pair Encoding (BPE), WordPiece, and SentencePiece.

from tokenizers import Tokenizer, models, pre_tokenizers, trainers

def train_bpe_tokenizer(texts: List[str]):
    """Trains a simple BPE tokenizer on a given text corpus."""
    tokenizer = Tokenizer(models.BPE())
    tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
    trainer = trainers.BpeTrainer(special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])
    tokenizer.train_from_iterator(texts, trainer)
    return tokenizer

# Example usage with Tiny Shakespeare dataset (or any small dataset you provide)
dataset = ["This is an example sentence.", "Another sentence goes here."]
bpe_tokenizer = train_bpe_tokenizer(dataset)

# Encoding a sample sentence
encoded = bpe_tokenizer.encode("This is an example sentence.")
print("BPE Tokens:", encoded.tokens)

## Section 4: Preparing a Dataset for Future Use

### Creating a Tokenized Dataset
# We will process a small dataset and tokenize it for use in future sessions.

def prepare_tokenized_dataset(texts: List[str], tokenizer):
    """Tokenizes and encodes a dataset using the given tokenizer."""
    return [tokenizer.encode(text).tokens for text in texts]

# Tokenizing our dataset
tokenized_dataset = prepare_tokenized_dataset(dataset, bpe_tokenizer)
print("Tokenized Dataset:", tokenized_dataset)

# Saving the tokenizer for reuse
bpe_tokenizer.save("bpe_tokenizer.json")

# Conclusion: In this session, we covered:
# - Basic text preprocessing
# - Different types of tokenization
# - Implementing word, character, and subword tokenization
# - Creating a small dataset for future use


Original: Hello, World! NLP is amazing.
Processed: hello world nlp is amazing
Word Tokens: ['hello', 'world', 'nlp', 'is', 'amazing']
Character Tokens: ['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', ' ', 'n', 'l', 'p', ' ', 'i', 's', ' ', 'a', 'm', 'a', 'z', 'i', 'n', 'g']



BPE Tokens: ['This', 'is', 'an', 'example', 'sentence', '.']
Tokenized Dataset: [['This', 'is', 'an', 'example', 'sentence', '.'], ['Another', 'sentence', 'goes', 'here', '.']]
