In [2]:
# !pip install datasets

### Downloading hindi dataset using hf datasets library

This section outlines the process of obtaining a large-scale Hindi text corpus. We leverage the Hugging Face datasets library, a powerful tool for efficiently loading and managing various datasets. Specifically, we'll download the Wikipedia Hindi dataset, configured to the '20231101.hi' version. This ensures we acquire a consistent and recent snapshot of Hindi articles from Wikipedia, providing a diverse and extensive text source crucial for training robust natural language processing models, such as custom tokenizers.



In [4]:
# import hindi corpus dataset
from tqdm  import tqdm
from datasets import load_dataset

# Try loading Wikipedia Hindi dataset
dataset = load_dataset('wikimedia/wikipedia', '20231101.hi')

README.md: 0.00B [00:00, ?B/s]

20231101.hi/train-00000-of-00002.parquet:   0%|          | 0.00/135M [00:00<?, ?B/s]

20231101.hi/train-00001-of-00002.parquet:   0%|          | 0.00/103M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/163093 [00:00<?, ? examples/s]

In [5]:
print(dataset)
print(dataset['train'][0])

DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 163093
    })
})
{'id': '10', 'url': 'https://hi.wikipedia.org/wiki/%E0%A4%B9%E0%A4%AE%20%E0%A4%B9%E0%A5%8B%E0%A4%82%E0%A4%97%E0%A5%87%20%E0%A4%95%E0%A4%BE%E0%A4%AE%E0%A4%AF%E0%A4%BE%E0%A4%AC', 'title': 'हम होंगे कामयाब', 'text': 'हम होंगे कामयाब ( का गिरिजा कुमार माथुर द्वारा किया गया हिंदी भावानुवाद) एक प्रतिरोध गीत है। यह गीत बीसवीं सदी में नागरिक अधिकार आंदोलन का प्रधान स्वर बना। इस गीत को आमतौर पर "I\'ll Overcome Some Day" ("आई विल ओवरकम सम डे") से काव्यावतरित माना जाता है, जो चार्ल्स अल्बर्ट टिंडले द्वारा गाया गया था और जिसे 1900 में पहली बार प्रकाशित किया गया था।\n\nसन्दर्भ\nHum Honge Kamyab Lyrics \nनागरिक अधिकार आंदोलन\nदेशभक्ति के गीत\nआधार'}


In [8]:
# we will select 1000 documents
texts = dataset['train']['text'][:1000]
combined_text = '\n'.join(texts)
print(f"Total characters: {len(combined_text)}")

Total characters: 7485244


### defining tokenizer training functions

implements a simplified Byte Pair Encoding (BPE) algorithm for training a custom tokenizer. Here's a detailed breakdown of each function:

**get_stats(ids, counts=None)** :

**Purpose**: This function is responsible for counting the occurrences of consecutive byte pairs (or token pairs) within a given sequence of ids.
How it works: It iterates through the input list of ids, taking two elements at a time (e.g., (ids[i], ids[i+1])). For each such pair, it increments its count in a dictionary. This dictionary keeps track of how frequently each unique pair appears.
**Role in BPE**: In BPE, the core idea is to iteratively merge the most frequent pair of bytes/tokens into a new, single token. This function identifies that 'most frequent pair' by calculating the counts of all possible pairs.


**merge(ids, pair, idx)**:

**Purpose**: This function takes a sequence of ids and replaces all occurrences of a specific pair with a new idx (token ID).
How it works: It iterates through the ids. When it encounters the pair (e.g., ids[i] == pair[0] and ids[i+1] == pair[1]), it replaces these two ids with the single idx and advances its position by two. Otherwise, it just appends the current id and advances by one.

**Role in BPE**: After get_stats identifies the most frequent pair, merge is used to actually perform the merge operation, effectively shrinking the sequence and creating a new token in the vocabulary.


**train(text, target_ratio=3, target_vocab_size=5000)**:

**Purpose**: This is the main function that orchestrates the BPE training process to build a tokenizer vocabulary and a set of merge rules.
How it works:

**Preprocessing**: It first pre-processes the input text by replacing spaces with underscores and encoding the text into UTF-8 bytes. These bytes are the initial 'tokens'.
Iterative Merging: It enters a loop that continues until the target_vocab_size is reached.
Inside the loop, it calls get_stats to find the most frequent pair in the current tokens sequence.
It assigns a new token ID (idx, starting from 256 for byte tokens) to this most frequent pair.
It records this merge rule (pair -> idx) in the merges dictionary.
It then calls merge to apply this new rule to the tokens sequence, effectively replacing all occurrences of the pair with the new idx.

**Vocabulary Construction**: After the merging process is complete, it constructs the final vocab dictionary. This dictionary maps every token ID (both original byte IDs and newly created merged token IDs) to its corresponding byte sequence.
Role in BPE: This function encapsulates the entire BPE algorithm, repeatedly identifying and merging the most frequent token pairs to gradually build a vocabulary of common subword units, which is crucial for efficient text processing and representation.



In [9]:
def get_stats(ids, counts=None):
    """
    Count occurrences of consecutive byte pairs.

    Args:
        ids: List of token IDs
        counts: Optional existing counts dictionary to update

    Returns:
        Dictionary mapping pairs to their occurrence counts
    """
    counts = {} if counts is None else counts

    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1

    return counts

In [10]:
def merge(ids, pair, idx):
    """
    Replace all occurrences of a pair with a new token ID.

    Args:
        ids: List of token IDs
        pair: Tuple of (token1, token2) to merge
        idx: New token ID to replace the pair with

    Returns:
        New list with pairs replaced
    """
    new_ids = []
    i = 0

    while i < len(ids):
        # Check if current position matches the pair
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
            new_ids.append(idx)
            i += 2  # Skip both elements of the pair
        else:
            new_ids.append(ids[i])
            i += 1

    return new_ids

In [11]:
def train(text, target_ratio=3, target_vocab_size=5000):
    text = '_' + text.replace(' ', '_')
    tokens = list(text.encode("utf-8"))
    original_length = len(tokens)
    merges = {}
    i = 1

    while True:
        stats = get_stats(tokens)
        pair = max(stats, key=stats.get)
        idx = 256 + i
        merges[pair] = idx
        tokens = merge(tokens, pair, idx)

        current_ratio = original_length / len(tokens)
        current_vocab = 256 + i

        # print(f"pair created: {pair} -> {idx} ({stats[pair]} occurrences), ratio: {current_ratio:.2f}, vocab: {current_vocab}")

        if current_vocab >= target_vocab_size:
            break

        i+=1

    vocab = {idx: bytes([idx]) for idx in range(256)}
    for pair,idx in merges.items():
        vocab[idx] = vocab[pair[0]] + vocab[pair[1]]

    return merges, vocab

### add encoding and decoding functions

**encode(text)** function:

Purpose: This function takes raw text as input and converts it into a sequence of numerical token IDs, representing the tokenized form of the text.
How it works:
Preprocessing: It first pre-processes the input text by replacing spaces with underscores and encoding it into UTF-8 bytes, similar to the initial step in the train function.
Applying Merges: It then iteratively applies all the merge rules (the pair -> idx mappings stored in the merges dictionary) in the exact order they were learned during training. Each merge operation uses the merge helper function to replace occurrences of a frequent byte pair with its corresponding new token ID. This process progressively reduces the sequence of initial bytes into a more compressed sequence of higher-level tokens.
What it returns: A list of integers, where each integer is a token ID representing a subword unit or character from the input text.


**decode(ids)** function:

Purpose: This function performs the reverse operation of encode. It takes a list of token IDs and reconstructs the original human-readable text.
How it works:
Lookup in Vocab: For each id in the input list, it looks up the corresponding byte sequence in the vocab dictionary. The vocab dictionary maps every token ID (both original bytes and merged tokens) back to its original byte representation.
Concatenation: It concatenates all these retrieved byte sequences into a single bytes object.
Decoding and Postprocessing: Finally, it decodes this bytes object back into a UTF-8 string, replaces the underscores with spaces, and removes any leading/trailing whitespace to restore the text to its original form.
What it returns: The reconstructed str (text) that was originally encoded.

In [16]:
def encode(text):
    text = '_' + text.replace(' ', '_')
    tokens = list(text.encode("utf-8"))
    if len(tokens) < 2:
        return tokens
    for pair, idx in merges.items():
        tokens = merge(tokens, pair, idx)
    return tokens

def decode(ids):
    bytes_list = b"".join([vocab[id] for id in ids])
    text = bytes_list.decode("utf-8")
    text = text.replace('_', ' ').strip()
    return text

### training and test - checking compression ratio obtained

Training the tokenizer on a subset of data:

Purpose: To manage computational resources and focus the training on a representative sample, it uses a smaller_text which is the first 100,000 characters of the combined_text from the dataset.
Process: It calls the train function, passing this smaller_text and setting a target_vocab_size of 6000. This means the BPE algorithm will iteratively perform merges until the vocabulary contains 6000 tokens (including the initial 256 byte tokens).
Output: This cell generates the merges dictionary (containing the learned merge rules) and the vocab dictionary (mapping token IDs to their byte representations).


In [14]:
# Use only first 100K characters
smaller_text = combined_text[:100000]
merges, vocab = train(smaller_text,target_vocab_size=6000)

In [15]:
### Look at the first 30 merges with decoded patterns
for i, (pair, idx) in enumerate(list(merges.items())[:30]):
    decoded_pattern = vocab[idx].decode('utf-8', errors='ignore')
    print(f"Token {idx}: {pair} -> '{decoded_pattern}'")

Token 257: (224, 164) -> ''
Token 258: (224, 165) -> ''
Token 259: (95, 257) -> '_'
Token 260: (257, 190) -> 'ा'
Token 261: (258, 141) -> '्'
Token 262: (261, 257) -> '्'
Token 263: (260, 257) -> 'ा'
Token 264: (258, 135) -> 'े'
Token 265: (257, 191) -> 'ि'
Token 266: (258, 128) -> 'ी'
Token 267: (265, 257) -> 'ि'
Token 268: (257, 176) -> 'र'
Token 269: (259, 149) -> '_क'
Token 270: (257, 130) -> 'ं'
Token 271: (264, 259) -> 'े_'
Token 272: (266, 259) -> 'ी_'
Token 273: (257, 168) -> 'न'
Token 274: (257, 164) -> 'त'
Token 275: (258, 139) -> 'ो'
Token 276: (260, 259) -> 'ा_'
Token 277: (257, 149) -> 'क'
Token 278: (270, 259) -> 'ं_'
Token 279: (262, 176) -> '्र'
Token 280: (257, 184) -> 'स'
Token 281: (258, 136) -> 'ै'
Token 282: (263, 176) -> 'ार'
Token 283: (258, 129) -> 'ु'
Token 284: (95, 95) -> '__'
Token 285: (259, 184) -> '_स'
Token 286: (257, 174) -> 'म'


In [17]:
# Test 1: Encoding and decoding
test_text = "आर्टिफिशियल इंटेलिजेंस भारत में"
encoded = encode(test_text)
decoded = decode(encoded)
print("Test 1: Encode/Decode")
print(f"Original: {test_text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
print(f"Match: {decoded == test_text}\n")

Test 1: Encode/Decode
Original: आर्टिफिशियल इंटेलिजेंस भारत में
Encoded: [396, 1346, 267, 171, 492, 309, 297, 4792, 811, 824, 596, 280, 398, 472]
Decoded: आर्टिफिशियल इंटेलिजेंस भारत में
Match: True



In [18]:
# Test 2: Compression ratio
original_bytes = len(test_text.encode('utf-8'))
print("Test 2: Compression")
print(f"Original bytes: {original_bytes}")
print(f"Encoded tokens: {len(encoded)}")
print(f"Compression ratio: {original_bytes / len(encoded):.2f}\n")

Test 2: Compression
Original bytes: 87
Encoded tokens: 14
Compression ratio: 6.21



In [19]:
# Test 3: Vocab size
print("Test 3: Vocabulary")
print(f"Vocab size: {len(vocab)}")
print(f"Number of merges: {len(merges)}")

Test 3: Vocabulary
Vocab size: 6000
Number of merges: 5744


### Save merges and vocab for tokenzier use

In [20]:
import base64
import json

# Convert vocab: bytes values to base64 strings
vocab_serializable = {
    idx: base64.b64encode(byte_seq).decode('utf-8')
    for idx, byte_seq in vocab.items()
}

# Convert merges: tuple keys to strings
merges_serializable = {
    f"{k[0]},{k[1]}": v
    for k, v in merges.items()
}

# Save to JSON files
with open('vocab.json', 'w', encoding='utf-8') as f:
    json.dump(vocab_serializable, f, ensure_ascii=False, indent=2)

with open('merges.json', 'w', encoding='utf-8') as f:
    json.dump(merges_serializable, f, ensure_ascii=False, indent=2)

print("Tokenizer saved to vocab.json and merges.json")

Tokenizer saved to vocab.json and merges.json
