# Session 1 – Data Preprocessing and Tokenization

In this session, we will learn about fundamental techniques for **data preprocessing** and **tokenization** – two of the most important steps in preparing text data for language model training. 

## Table of Contents

1. [Introduction and Overview](#introduction)
1. [Data Cleaning](#data-cleaning)
1. [Text Normalization](#text-normalization)
1. [Basic Tokenization Concepts](#basic-tokenization)
    1. [Word-Level Tokenization](#word-level-tokenization)
    1. [Character-Level Tokenization](#character-level-tokenization)
    1. [Subword Tokenization](#subword-tokenization)
1. [Using Hugging Face’s Tokenizers Library](#hf-tokenizers)
1. [Putting It All Together - Processing a Small Dataset](#dataset-processing)
1. [Conclusion](#conclusion)

Each of these sections includes detailed explanations and, after each major concept, there’s a hands-on exercise for you to try out. Let's get started!


# <a id="introduction"></a>Introduction and Overview

Before we can train language models to process and understand text, we need to ensure our data is properly **cleaned**, **normalized**, and **tokenized**. 

1. **Data Cleaning**: Removing unwanted characters, dealing with punctuation, whitespace, etc.  
2. **Text Normalization**: Converting text into a canonical or standard form.  
3. **Tokenization**: Splitting text into meaningful units called tokens (characters, words, or subwords).  

By the end of this session, you should be able to:
- Understand why data preprocessing is critical for NLP tasks.
- Implement and customize various tokenization methods.
- Use libraries like **Hugging Face’s `tokenizers`** to train advanced tokenizers.

Let's jump right into the details!

# Data Cleaning
<a id="data-cleaning"></a>

Data cleaning is often the very first step. It's about **removing or transforming unwanted elements** in text, such as:
- HTML tags or XML markup
- Extra whitespace
- Numeric or special characters (if they are not meaningful for your task)
- Stopwords (like "the", "a", "and") in some contexts
- URLs and email addresses

**Key intuition**: The goal is to reduce the “noise” in the text while retaining the parts that are valuable for your model. Sometimes, "noise" may be crucial for tasks like named entity recognition, so always be mindful of your end-goals.

Below is an example of common cleaning steps:

1. **Removing or replacing URLs**: `www.example.com`, `http://...`, etc.
2. **Removing special characters/emojis**: e.g., `😊`, `😢`, etc. (But if emojis matter to your problem, keep them!)
3. **Removing multiple consecutive whitespaces**.
4. **Case folding**: converting text to lowercase (often helps for text classification or language modeling).

### Code Example: Basic Data Cleaning

When you run the below code, you should see that the function:
- Removes the URL.
- Strips out HTML tags.
- Removes punctuation and special characters.
- Converts text to lowercase.
- Collapses consecutive spaces.

In [2]:
import re

def basic_data_cleaning(text):
    # Remove URLs
    text = re.sub(r"http\S+|www.\S+", "", text)
    
    # Remove HTML tags
    text = re.sub(r"<.*?>", "", text)
    
    # Remove non-alphanumeric characters (keeping spaces)
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove extra whitespace
    text = re.sub(r"\s+", " ", text).strip()
    
    return text

# Test the function
sample_text = """Hello World! Visit us at http://example.com.  
                 <b>Thank</b> you! 123 :-)
              """
cleaned_text = basic_data_cleaning(sample_text)
print("Original:", sample_text)
print("Cleaned: ", cleaned_text)

Original: Hello World! Visit us at http://example.com.  
                 <b>Thank</b> you! 123 :-)
              
Cleaned:  hello world visit us at thank you 123


### Exercise
**Goal**: Write a custom cleaning function that handles **emails**, **numbers**, and **emojis** differently.  
1. Keep emails as a single token, like `EMAILTOKEN`.  
2. Replace all numeric values with `NUMTOKEN`.  
3. Convert emojis into a textual representation (e.g., `😊` → `EMOJI_SMILE`).  
4. Test your function on a few sentences containing these elements.  

*Hint*: You can use Python’s **`re`** module for pattern matching and substitution.  
Try to carefully decide which patterns you want to remove and which you want to keep as placeholders (like `EMAILTOKEN`). 

In [34]:
EMAILTOKEN = r"<|EMAIL|>"
NUMTOKEN = r"<|NUMTOKEN|>"
EMOJITOKEN = r"<|EMOJITOKEN|>"
    
def process_emails(text, email_token):
    """Replaces all emails with the 'email_token'"""
    text = re.sub(r"[\w\d\.+-]+?@[\w\d\.-]+?\.[\w\d\.]+", email_token, text)
    return text

def process_numbers(text, number_token):
    """Replaces all numbers with the 'number_token'"""
    text = re.sub(r"[\d]+", number_token, text)
    return text

def process_emojis(text, emoji_token=None):
    """Replaces emojis with either a single token or named token.

    Emoji API is provided by 'emoji' package, which has 'demojize' and 'replace_emoji' functions.

    If 'emoji_token' is None, we will use the capitalized emoji name provided by 'emoji' package.
    """
    import emoji
    if emoji_token is None:
        text = emoji.demojize(text, delimiters=("<|", "|>"))
    else:
        text = emoji.replace_emoji(text, replace=EMOJITOKEN)
    return text

def process(text):
    text = process_emails(text, EMAILTOKEN)
    text = process_numbers(text, NUMTOKEN)
    text = process_emojis(text, EMOJITOKEN)
    return text
    


# Additional test cases
tests = (
    # 1. No special tokens
    ("No special tokens here", "No special tokens here"),
    
    # 2. Multiple emails
    (
        "My emails: foo@example.com, bar@example.org; write to me anytime!",
        f"My emails: {EMAILTOKEN}, {EMAILTOKEN}; write to me anytime!"
    ),
    
    # 3. Complex email formats
    (
        "Please email me at some.email+tag@my-company.co.uk or admin@co.jp",
        f"Please email me at {EMAILTOKEN} or {EMAILTOKEN}"
    ),
    
    # 4. Multiple numbers
    (
        "Numbers: 123, 2023, 42, and 007.",
        f"Numbers: {NUMTOKEN}, {NUMTOKEN}, {NUMTOKEN}, and {NUMTOKEN}."
    ),
    
    # 5. Phone number or numeric strings with punctuation
    (
        "Call me at 123-456-7890!",
        f"Call me at {NUMTOKEN}-{NUMTOKEN}-{NUMTOKEN}!"
    ),
    
    # 6. Multiple emojis
    (
        "I love 🍕 and also this one 👍!",
        f"I love {EMOJITOKEN} and also this one {EMOJITOKEN}!"
    ),
    
    # 7. Mixed scenario
    (
        "Email: me+123@foo.bar, number: 999, emoji: 🤗",
        f"Email: {EMAILTOKEN}, number: {NUMTOKEN}, emoji: {EMOJITOKEN}"
    )
)


failed_samples = []
for idx, (text, expected) in enumerate(tests):
    output = process(text)
    if output != expected:
        print("x", end="")
        failed_samples.append([idx, text, output, expected])
    else:
        print(".", end="")
print()

if failed_samples:
    print("Failed samples:")
    for idx, inp, outp, exp in failed_samples:
        print(f"Test {idx}:")
        print(f"  Input:    {inp}")
        print(f"  Expected: {exp}")
        print(f"  Got:      {outp}")
else:
    print("All tests passed!")

.......
All tests passed!


# 3. Text Normalization
<a id="text-normalization"></a>

Text normalization is about **converting text into a standard (canonical) form**. This includes:
- **Case folding** (already shown above, converting text to lowercase).
- **Stemming**: Reducing words to their "stem" or root form (e.g., `walking` → `walk`, `studies` → `studi`).
- **Lemmatization**: More advanced than stemming. Uses the vocabulary and morphological analysis to get the canonical form of a word (e.g., `walking` → `walk`, `studies` → `study`).

**Stemming** is often simpler and faster but less accurate than **lemmatization**. Lemmers require dictionaries or morphological analyzers to convert a word to its correct lemma.

### Example: Using NLTK for Stemming and Lemmatization

Below, notice how **stemming** might cut words more aggressively, while **lemmatization** uses morphological knowledge to find the correct base form. 

In [4]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import word_tokenize

porter_stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at the same time. He has bad habit of swimming after playing long hours in the Sun."

tokens = word_tokenize(sentence.lower())
stems = [porter_stemmer.stem(token) for token in tokens]
lemmas = [lemmatizer.lemmatize(token) for token in tokens]

print("Original Tokens:", tokens)
print("Stemmed Tokens: ", stems)
print("Lemmatized Tokens:", lemmas)

Original Tokens: ['he', 'was', 'running', 'and', 'eating', 'at', 'the', 'same', 'time', '.', 'he', 'has', 'bad', 'habit', 'of', 'swimming', 'after', 'playing', 'long', 'hours', 'in', 'the', 'sun', '.']
Stemmed Tokens:  ['he', 'wa', 'run', 'and', 'eat', 'at', 'the', 'same', 'time', '.', 'he', 'ha', 'bad', 'habit', 'of', 'swim', 'after', 'play', 'long', 'hour', 'in', 'the', 'sun', '.']
Lemmatized Tokens: ['he', 'wa', 'running', 'and', 'eating', 'at', 'the', 'same', 'time', '.', 'he', 'ha', 'bad', 'habit', 'of', 'swimming', 'after', 'playing', 'long', 'hour', 'in', 'the', 'sun', '.']


### Exercise
**Goal**:  
- Experiment with different stemmers in NLTK (e.g., `SnowballStemmer`) and compare the outputs.  
- Create a custom lemmatization pipeline for your text.  
  - Try to handle different parts of speech (POS). NLTK’s `WordNetLemmatizer` can use POS tags to correctly lemmatize words (e.g., `lemmatizer.lemmatize(word, pos='v')` for verbs).  
- Observe how the results differ depending on which approach you use.  

**Hint**: You can get POS tags using `nltk.pos_tag(tokens)`, then map NLTK’s POS tags to WordNet’s POS notation (like `nltk.corpus.wordnet.VERB`). 

In [39]:
nltk.download('averaged_perceptron_tagger_eng', quiet=True)
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet

def nltk_pos_to_wordnet_pos(nltk_pos_tag):
    """Map the NLTK part-of-speech tags to WordNet's part-of-speech format.
    """
    if nltk_pos_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_pos_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_pos_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

def custom_lemmatizer(text):
    """
    Tokenize, POS-tag, and then lemmatize each token with WordNetLemmatizer,
    using the mapped POS tags for better accuracy.
    """
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)  # [('running', 'VBG'), ('dogs', 'NNS'), ...]

    lemmatized_tokens = []
    for token, pos in pos_tags:
        wn_pos = nltk_pos_to_wordnet_pos(pos)
        if wn_pos: 
            lemmatized_tokens.append(lemmatizer.lemmatize(token, wn_pos))
        else:
            # If we don't have a matching POS, just use the token as is
            lemmatized_tokens.append(lemmatizer.lemmatize(token))
    return lemmatized_tokens

# Example text
text = "He was running and eating at the same time. He has bad habits of swimming and studies!"
    
# 1. Compare two stemmers:
porter = PorterStemmer()
snowball = SnowballStemmer("english")
tokens = word_tokenize(text.lower())

porter_stems = [porter.stem(token) for token in tokens]
snowball_stems = [snowball.stem(token) for token in tokens]
lemmatized = custom_lemmatizer(text.lower())

print("Original Tokens     :", tokens)
print("PorterStemmer       :", porter_stems)
print("SnowballStemmer     :", snowball_stems)
print("Custom Lemmatization:", lemmatized)

Original Tokens     : ['he', 'was', 'running', 'and', 'eating', 'at', 'the', 'same', 'time', '.', 'he', 'has', 'bad', 'habits', 'of', 'swimming', 'and', 'studies', '!']
PorterStemmer       : ['he', 'wa', 'run', 'and', 'eat', 'at', 'the', 'same', 'time', '.', 'he', 'ha', 'bad', 'habit', 'of', 'swim', 'and', 'studi', '!']
SnowballStemmer     : ['he', 'was', 'run', 'and', 'eat', 'at', 'the', 'same', 'time', '.', 'he', 'has', 'bad', 'habit', 'of', 'swim', 'and', 'studi', '!']
Custom Lemmatization: ['he', 'be', 'run', 'and', 'eat', 'at', 'the', 'same', 'time', '.', 'he', 'have', 'bad', 'habit', 'of', 'swimming', 'and', 'study', '!']


### Observations
* PorterStemmer may cut off endings more aggressively.
* SnowballStemmer (English) might behave slightly differently on certain words.
* Lemmatization can produce more readable tokens, especially if you provide the correct POS tags.

# 4. Basic Tokenization Concepts
<a id="basic-tokenization"></a>

Tokenization is the process of **splitting text into smaller units** called *tokens*. The granularity of tokens can be:
- **Word-level**: e.g., "Hello World" → ["Hello", "World"]
- **Character-level**: e.g., "Hello" → ["H", "e", "l", "l", "o"]
- **Subword-level**: e.g., "tokenization" → ["to", "ken", "ization"]

### Key Intuition:
1. **Word-level** tokenization is intuitive but can suffer with out-of-vocabulary (OOV) words.  
2. **Character-level** tokenization rarely has OOV issues but might lead to longer sequences to process.  
3. **Subword-level** tokenization (like Byte Pair Encoding – BPE, WordPiece, etc.) strikes a balance between word- and character-level.  

We will explore each approach in more detail.

## 4.1. Word-Level Tokenization
<a id="word-level-tokenization"></a>

### Concept & Example

Word-level tokenization splits text on spaces and punctuation, often discarding punctuation as separate tokens (depending on your tokenizer design). For simple English text, you can use **Python’s `split()`** or **libraries like NLTK’s `word_tokenize`**.

**Why use word-level tokenization?**  
- It's straightforward and often sufficient for many tasks.
- It’s a good starting point for simple classification or bag-of-words models.

**Drawbacks**:  
- Struggles with morphological variants (`walk`, `walks`, `walked`, `walking`) – they all become different tokens.  
- Large vocabulary size, leading to OOV issues and less efficient training.

### Example: Word-Level Tokenization with Python

**Note**: Notice how a naive split might remove punctuation differently than NLTK's sophisticated tokenizer. 

In [6]:
import re

def word_tokenize_simple(text):
    # Remove extra spaces and punctuation in a naive way
    text = re.sub(r'[^\w\s]', '', text)
    tokens = text.split()
    return tokens

sample_sentence = "Hello, world! This is a test sentence."
print("Naive Word Tokenization:", word_tokenize_simple(sample_sentence))

# Using NLTK for more sophisticated splitting:
import nltk
nltk_tokens = nltk.word_tokenize(sample_sentence)
print("NLTK Word Tokenization:", nltk_tokens)

Naive Word Tokenization: ['Hello', 'world', 'This', 'is', 'a', 'test', 'sentence']
NLTK Word Tokenization: ['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test', 'sentence', '.']


### Exercise
**Goal**:  
1. Write a function that splits text on white spaces **but** keeps punctuation as separate tokens.  
   - For example, `"Hello, world!"` → `["Hello", ",", "world", "!"]`.  
2. Compare it against `nltk.word_tokenize` on a set of example sentences.  
3. Discuss which approach might be more beneficial for sentiment analysis vs. language modeling.  

*Hint*: You can use regular expressions, capturing punctuation as separate tokens by including capturing groups for punctuation. 

In [40]:
import re

def whitespace_and_punct_tokenize(text):
    """
    Splits on whitespace, but also captures punctuation as separate tokens.
    For instance, "Hello, world!" -> ["Hello", ",", "world", "!"].
    """
    # Explanation of the regex:
    # (        start capture group
    #   [^\s\w]    matches any punctuation/symbol (non-whitespace, non-word)
    #   |          OR
    #   \w+        matches sequences of word characters
    # )        end capture group
    pattern = r"([^\s\w]|[\w]+)"
    return re.findall(pattern, text)

# Example usage
sample = "Hello, world! This is a test-sentence."
custom_tokens = whitespace_and_punct_tokenize(sample)
print("Custom Tokenization:", custom_tokens)

# Compare to NLTK
nltk_tokens = nltk.word_tokenize(sample)
print("NLTK Tokenization  :", nltk_tokens)


Custom Tokenization: ['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test', '-', 'sentence', '.']
NLTK Tokenization  : ['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test-sentence', '.']


### Observations

You can see how these two approaches differ in how they handle the hyphen in "test-sentence" or punctuation spacing.

**Which Approach for Which Task?**
* Sentiment Analysis often benefits from preserving punctuation (e.g., exclamation marks, question marks).
* Language Modeling might still use punctuation but sometimes merges punctuation with preceding words.

## 4.2. Character-Level Tokenization
<a id="character-level-tokenization"></a>

Character-level tokenization treats **every character as a token**. This approach ensures:
- **No OOV problems** (every symbol is known).
- Suitable for languages with complex morphology or no clear word boundaries (e.g., Chinese).

**Drawbacks**:
- Very long sequences (each sentence becomes a big list of characters).
- Models need more capacity to learn word-level or phrase-level meaning from individual characters.

### Example: Character-Level Tokenization

If you run the cell below, you should see something like:

`Character Tokens: ['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!']`

In [8]:
def character_tokenize(text):
    # Simply convert the text into a list of characters
    return list(text)

sample_sentence = "Hello world!"
char_tokens = character_tokenize(sample_sentence)
print("Character Tokens:", char_tokens)

Character Tokens: ['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!']


### Exercise
**Goal**:  
1. Implement a character-level tokenizer that **filters out** certain characters (e.g., remove digits, punctuation).  
2. Use it on a short paragraph and analyze how many tokens you get compared to word-level.  
3. (Optional) Experiment with training a simple language model (like an LSTM or GPT-like architecture) on a short text corpus using character-level tokens, and compare the performance with word-level tokens.  

In [41]:
def char_level_filter_tokenizer(text):
    """
    Returns a list of characters, filtering out digits and punctuation.
    Keeps only alphabets and spaces for simplicity.
    """
    # You can refine or expand the set of kept characters as needed
    filtered_chars = []
    for ch in text:
        if ch.isalpha() or ch.isspace():
            filtered_chars.append(ch)
        # else skip punctuation, digits, etc.
    return filtered_chars

paragraph = "Hello Zafar! I have 2 apples, and 3 bananas."
chars = char_level_filter_tokenizer(paragraph)
print("Filtered Characters:", chars)
print("As joined string   :", ''.join(chars))

# Compare with a naive word-level:
word_tokens = paragraph.split()
print("Word-Level Tokens  :", word_tokens)

print(f"\nNumber of character tokens: {len(chars)}")
print(f"Number of word tokens:      {len(word_tokens)}")


Filtered Characters: ['H', 'e', 'l', 'l', 'o', ' ', 'Z', 'a', 'f', 'a', 'r', ' ', 'I', ' ', 'h', 'a', 'v', 'e', ' ', ' ', 'a', 'p', 'p', 'l', 'e', 's', ' ', 'a', 'n', 'd', ' ', ' ', 'b', 'a', 'n', 'a', 'n', 'a', 's']
As joined string   : Hello Zafar I have  apples and  bananas
Word-Level Tokens  : ['Hello', 'Zafar!', 'I', 'have', '2', 'apples,', 'and', '3', 'bananas.']

Number of character tokens: 39
Number of word tokens:      9


### Observations

* The character-level approach yields a larger token count because each letter is a token.
* You can adapt the filtering logic depending on your domain needs (e.g., keep digits or punctuation if they’re important).

## 4.3. Subword Tokenization
<a id="subword-tokenization"></a>

Subword tokenization methods like **Byte Pair Encoding (BPE)** or **WordPiece** aim to find a **balance** between word-level and character-level. 

**Motivation**:
- **OOV** words are less of a problem: If a new word appears, it can be broken down into smaller subword units it has already seen.  
- **Vocabulary size** can be managed because it merges characters into subwords until a desired vocab size is reached.

Example (Byte Pair Encoding):
1. Start with every character as a token.  
2. Count the most frequent pairs of tokens and merge them into a single token.  
3. Repeat until reaching the desired vocabulary size.

**Key advantage**: It can handle morphological variations and new words effectively, while not exploding the sequence length as character-level does

### Simple Subword Tokenizer (BPE) Example

Below is a **simplified** BPE-like implementation to illustrate the concept. The real versions (like in Hugging Face tokenizers) are more complex and efficient.

**Note**: Example below uses **characters** as the starting point. We merge the most frequent pairs step by step. Eventually, tokens like `"lo"` or `"low"` might appear if they are frequent pairs.

In [42]:
from collections import defaultdict

def get_stats(tokenized_corpus):
    """
    Counts how frequently pairs of tokens appear.
    :param tokenized_corpus: list of sentences, each sentence is a list of tokens.
    :return: dictionary {(token1, token2): count}
    """
    pairs = defaultdict(int)
    for sentence in tokenized_corpus:
        for i in range(len(sentence)-1):
            pairs[(sentence[i], sentence[i+1])] += 1
    return pairs

def merge_most_frequent(tokenized_corpus, pair_to_merge):
    """
    Merge the most frequent pair in the corpus.
    """
    new_tokenized_corpus = []
    (a, b) = pair_to_merge
    merge_token = a + b
    for sentence in tokenized_corpus:
        new_sentence = []
        skip = False
        for i in range(len(sentence)):
            if skip:
                skip = False
                continue
            if i < len(sentence) - 1 and sentence[i] == a and sentence[i+1] == b:
                new_sentence.append(merge_token)
                skip = True
            else:
                new_sentence.append(sentence[i])
        new_tokenized_corpus.append(new_sentence)
    return new_tokenized_corpus

def bpe_training(tokenized_corpus, num_merges=10):
    """
    Trains a simple BPE tokenizer.
    :param tokenized_corpus: list of tokenized sentences
    :param num_merges: how many merges to perform
    :return: The merged corpus after BPE
    """
    for i in range(num_merges):
        pairs = get_stats(tokenized_corpus)
        if not pairs:
            break
        best_pair = max(pairs, key=pairs.get)
        tokenized_corpus = merge_most_frequent(tokenized_corpus, best_pair)
        print(f"Step {i+1}, Merged {best_pair} -> {best_pair[0] + best_pair[1]}")
    return tokenized_corpus

# Example usage
corpus = [
    list("low "),
    list("lower "),
    list("newest "),
    list("wider ")
]

merged_corpus = bpe_training(corpus, num_merges=5)
print("Merged Corpus:", merged_corpus)

Step 1, Merged ('l', 'o') -> lo
Step 2, Merged ('lo', 'w') -> low
Step 3, Merged ('e', 'r') -> er
Step 4, Merged ('er', ' ') -> er 
Step 5, Merged ('low', ' ') -> low 
Merged Corpus: [['low '], ['low', 'er '], ['n', 'e', 'w', 'e', 's', 't', ' '], ['w', 'i', 'd', 'er ']]


### Exercise
**Goal**:  
1. Take a short text corpus, tokenize it by characters, and then apply the above `bpe_training` function with different values of `num_merges`.  
2. Observe how tokens evolve with each merge step.  
3. Discuss how subword merges make sense for repeated sequences of characters.  

*(Optional)*:  
- Modify the code to start with word-level tokens instead of characters, then sub-split words by merges.  
- Experiment to see if you can reduce the vocabulary size while still retaining most of the text’s representational power.

In [43]:
sample_corpus = [
    "hello hello", 
    "hello world", 
    "world wide web"
]

# Split each sentence into a list of characters
tokenized_corpus = [list(sent) for sent in sample_corpus]

# Suppose you do 5 merges
merged_corpus_5 = bpe_training(tokenized_corpus, num_merges=5)

print("\nMerged corpus with 5 merges:")
for line in merged_corpus_5:
    print(line)

# Suppose you do 2 merges
tokenized_corpus = [list(sent) for sent in sample_corpus]  # reset
merged_corpus_2 = bpe_training(tokenized_corpus, num_merges=2)

print("\nMerged corpus with 2 merges:")
for line in merged_corpus_2:
    print(line)


Step 1, Merged ('h', 'e') -> he
Step 2, Merged ('he', 'l') -> hel
Step 3, Merged ('hel', 'l') -> hell
Step 4, Merged ('hell', 'o') -> hello
Step 5, Merged (' ', 'w') ->  w

Merged corpus with 5 merges:
['hello', ' ', 'hello']
['hello', ' w', 'o', 'r', 'l', 'd']
['w', 'o', 'r', 'l', 'd', ' w', 'i', 'd', 'e', ' w', 'e', 'b']
Step 1, Merged ('h', 'e') -> he
Step 2, Merged ('he', 'l') -> hel

Merged corpus with 2 merges:
['hel', 'l', 'o', ' ', 'hel', 'l', 'o']
['hel', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
['w', 'o', 'r', 'l', 'd', ' ', 'w', 'i', 'd', 'e', ' ', 'w', 'e', 'b']


### Observations

* Frequent pairs like ("h", "e") or ("e", "l") being merged first if they appear a lot.
* Compare the final tokens for num_merges=5 vs. num_merges=2. With fewer merges, you remain closer to pure character-level. With more merges, you start to see “subwords” forming.

# 5. Using Hugging Face’s Tokenizers Library
<a id="hf-tokenizers"></a>

While writing your own tokenizer is educational, for practical tasks, it’s more convenient (and often more optimized) to rely on well-tested libraries like **Hugging Face’s `tokenizers`**. This library offers efficient implementations of BPE, WordPiece, Unigram, and more.

**Note:** Make sure the `tokenizers` library is installed. You can install it using `pip install -U tokenizers`

### Example: Training a BPE Tokenizer with Hugging Face

**Key aspects**:
- We specify a **`vocab_size`** to control how many merges the tokenizer can learn.
- We include **special tokens** like `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]` for tasks like language modeling and classification.
- After training, we can **encode** new text to see how the tokenizer splits it into subword units.

In [12]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# 1. Initialize an empty BPE tokenizer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

# 2. Prepare trainer
trainer = BpeTrainer(vocab_size=2000, show_progress=True, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

# Suppose we have a list of file paths or text data. For a small example, let's create one text file:
sample_text_data = """Hello world
Hello there
This is a small example corpus for BPE Tokenizer
"""

with open("sample_corpus.txt", "w") as f:
    f.write(sample_text_data)

# 3. Train the tokenizer
tokenizer.train(files=["sample_corpus.txt"], trainer=trainer)

# 4. Encode some text
encoded = tokenizer.encode("Hello world, this is a test!")
print("Encoded IDs:", encoded.ids)
print("Encoded Tokens:", encoded.tokens)




Encoded IDs: [36, 65, 0, 25, 15, 34, 34, 10, 25, 13, 24, 25, 0]
Encoded Tokens: ['Hello', 'world', '[UNK]', 't', 'h', 'is', 'is', 'a', 't', 'e', 's', 't', '[UNK]']


### Exercise
**Goal**:  
1. Use Hugging Face’s tokenizers library to train a tokenizer on a larger corpus (e.g., **Tiny Shakespeare** if you have it, or any small open-source dataset).  
2. Experiment with different `vocab_size` values (e.g., 500, 1000, 5000). Compare the resulting tokens for some sample sentences.  
3. Make a short summary of how the choice of `vocab_size` affects:
   - The granularity of your tokens.  
   - The presence of `[UNK]` tokens when encoding new text.  

In [44]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

text_corpus = """Hello world
Hello there
This is a small example corpus for BPE Tokenizer
We will compare vocab sizes
"""

# Save it to a file
with open("my_corpus.txt", "w") as f:
    f.write(text_corpus)

def train_and_test_tokenizer(vocab_size):
    tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    tokenizer.pre_tokenizer = Whitespace()
    trainer = BpeTrainer(
        vocab_size=vocab_size, 
        show_progress=False, 
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
    )
    tokenizer.train(["my_corpus.txt"], trainer)
    return tokenizer

for vs in [200, 50]:
    tokenizer = train_and_test_tokenizer(vs)
    print(f"\n--- Vocab Size: {vs} ---")
    sample_text = "Hello world, this is a test!"
    encoding = tokenizer.encode(sample_text)
    print("Tokens:", encoding.tokens)
    print("Token IDs:", encoding.ids)



--- Vocab Size: 200 ---
Tokens: ['Hello', 'world', '[UNK]', 't', 'h', 'is', 'is', 'a', 't', 'es', 't', '[UNK]']
Token IDs: [41, 82, 0, 27, 17, 37, 37, 11, 27, 54, 27, 0]

--- Vocab Size: 50 ---
Tokens: ['Hello', 'w', 'or', 'l', 'd', '[UNK]', 't', 'h', 'is', 'is', 'a', 't', 'e', 's', 't', '[UNK]']
Token IDs: [41, 30, 34, 20, 14, 0, 27, 17, 37, 37, 11, 27, 15, 26, 27, 0]


### Observations

* With a larger vocabulary (e.g., 200), you’ll get fewer [UNK] tokens and generally bigger subwords.
* With a smaller vocabulary (e.g., 50), you might see text broken down into more subword pieces or even individual characters in some cases.

# 6. Putting It All Together – Processing a Small Dataset
<a id="dataset-processing"></a>

Let’s walk through a **mini-pipeline** for **Tiny Shakespeare** or any small text dataset you have. The steps are:

1. **Load the dataset**: Suppose it’s a single text file called `tiny_shakespeare.txt`.
2. **Clean the text**: Remove unnecessary characters, unify case, etc.
3. **(Optionally) Normalize**: Decide if stemming/lemmatization is beneficial for your task.
4. **Tokenize**: 
   - Decide on your method (word-level, char-level, subword).
   - Train or load a pre-trained tokenizer if you’re doing subword.
5. **Save the processed data**: Convert tokens to integer IDs (if subword or word-level) and store them in a format that your model can read.

### Example Pipeline for Tiny Shakespeare

**Note:** In a real project, you’d store these token IDs in a serialized format (e.g., NumPy array, PyTorch tensor) and then use them for training a language model.

In [14]:
# Suppose you have 'tiny_shakespeare.txt' in your working directory
# with open("tiny_shakespeare.txt", "r") as f:
#     shakespeare_data = f.read()

shakespeare_data = """From fairest creatures we desire increase, 
That thereby beauty's rose might never die..."""  # Mock sample

# 1. Clean the data
clean_text = basic_data_cleaning(shakespeare_data)
# 2. (Optional) we won't do advanced normalization for a Shakespeare corpus, 
#    but we might use lemmatization if it suits the task.

# 3. Train a BPE tokenizer using Hugging Face’s tokenizers
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=500, show_progress=True, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

# We'll save our cleaned text to a file for training the tokenizer
with open("shakespeare_cleaned.txt", "w") as f:
    f.write(clean_text)

tokenizer.train(files=["shakespeare_cleaned.txt"], trainer=trainer)

# 4. Encode text
encoded = tokenizer.encode(clean_text[:100])  # Just the first 100 chars
print("Tokens:", encoded.tokens)
print("Token IDs:", encoded.ids)

# 5. Save or convert processed data
# For instance, we can convert the entire text to a list of IDs
full_ids = tokenizer.encode(clean_text).ids
print("First 50 IDs of the corpus:", full_ids[:50])




Tokens: ['from', 'fairest', 'creatures', 'we', 'desire', 'increase', 'that', 'thereby', 'beautys', 'rose', 'might', 'never', 'die']
Token IDs: [63, 64, 68, 52, 61, 65, 57, 69, 70, 56, 66, 67, 62]
First 50 IDs of the corpus: [63, 64, 68, 52, 61, 65, 57, 69, 70, 56, 66, 67, 62]


### Exercise
**Goal**:
1. Download a small public-domain text (e.g., from [Project Gutenberg](https://www.gutenberg.org/) or [Tiny Shakespeare](https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt)) and run through this pipeline end-to-end.  
2. Report how many tokens you get in total, and how many unique tokens there are.  
3. Experiment with adding or removing cleaning steps. For example, keep punctuation vs. remove punctuation. See how it changes your token distribution.  

In [46]:
!curl -O https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100 1089k  100 1089k    0     0  4174k      0 --:--:-- --:--:-- --:--:-- 4189k


In [49]:
tiny_shakespeare = "input.txt"  # Download from https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

# Step 1: load
with open(tiny_shakespeare, "r", encoding="utf-8") as f:
    text_data = f.read()

# Step 2: clean
cleaned_text = basic_data_cleaning(text_data)  # from your earlier function

# Optional normalization step:
# tokens = word_tokenize(cleaned_text)
# lemmatized_tokens = [lemmatizer.lemmatize(t) for t in tokens]
# cleaned_text = " ".join(lemmatized_tokens)

# Step 3: Train tokenizer
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=1000, show_progress=True, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

# We'll write out the cleaned text to a file
with open("cleaned_gutenberg.txt", "w", encoding="utf-8") as f:
    f.write(cleaned_text)

tokenizer.train(["cleaned_gutenberg.txt"], trainer)

# Step 4 & 5: Encode, measure total tokens and unique tokens
encoded_output = tokenizer.encode(cleaned_text)
token_ids = encoded_output.ids
print("Total tokens in corpus:", len(token_ids))
print("Unique token IDs:", len(set(token_ids)))

# (Optional) Compare punctuation removal vs. punctuation kept:
# - Rerun the pipeline without removing punctuation in 'basic_data_cleaning'
# - Check how the vocabulary distribution changes





Total tokens in corpus: 297788
Unique token IDs: 972


### Observations

* Vocabulary distribution: Larger vocab → fewer merges, more unique tokens.
* Removing punctuation might reduce the overall variety in tokens, but at the potential cost of losing punctuation signals.
* Comparing final tokenized outputs can show how different cleaning steps drastically alter the final ID sequences.


# Conclusion
<a id="conclusion"></a>

Congratulations on completing **Session 1**: Data Processing and Tokenization! You have learned:

1. **Why cleaning and normalization** are essential to reduce noise and help models focus on relevant patterns.
2. **Word-level**, **character-level**, and **subword** tokenization strategies:
   - Their pros and cons.
   - Practical examples of how to implement or use them.
3. **Hands-on with Hugging Face tokenizers** to train a subword tokenizer on your own dataset.

**Next Steps**:
- In future sessions, we’ll look at building and training language models using these tokenized datasets.
- Keep experimenting with different tokenization and normalization techniques, as these can drastically affect model performance.
