# char-tokenizer.ipynb

> Implementation of a character-level tokenizer.

## Attribution
The code in this notebook (`char-tokenizer.ipynb`) and the resulting module (`transformer_experiments.tokenizers.char_tokenizer`) is not mine. It comes from [Andrej Karpathy](https://karpathy.ai/)'s excellent video, [Let's build GPT: from scratch, in code, spelled out](https://www.youtube.com/watch?v=kCc8FmEb1nY). I typed in the code by copying what I saw on the screen as I watched the video. For things that weren't clear onscreen, I referenced the [GitHub repo for the video](https://github.com/karpathy/ng-video-lecture) and the [nanoGPT repo](https://github.com/karpathy/nanoGPT). After getting it working, I made only minor changes to make it work with the rest of the code in/structure of this repository. In summary: this module is Andrej Karpathy's work, not mine.

In [None]:
#| default_exp tokenizers.char_tokenizer

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
#| hide
from fastcore.test import *

In [None]:
#| export 
from typing import Callable, Dict, Iterable, Tuple

In [None]:
# | export
SToI = Dict[str, int]
IToS = Dict[int, str]
EncodeFn = Callable[[str], Iterable[int]]
DecodeFn = Callable[[Iterable[int]], str]


def create_character_tokenizer(
    text: str,
) -> Tuple[Iterable[str], int, SToI, IToS, EncodeFn, DecodeFn]:
    """Create a character tokenizer from text."""
    chars = sorted(list(set(text)))
    vocab_size = len(chars)
    stoi = {ch: i for i, ch in enumerate(chars)}
    itos = {i: ch for i, ch in enumerate(chars)}
    encode = lambda s: [stoi[c] for c in s]
    decode = lambda l: ''.join([itos[i] for i in l])

    return chars, vocab_size, stoi, itos, encode, decode

In [None]:
# Tests for create_character_tokenizer
chars, vocab_size, stoi, itos, encode, decode = create_character_tokenizer('abcabc')
test_eq(chars, ['a', 'b', 'c'])
test_eq(vocab_size, 3)
test_eq(stoi, {'a': 0, 'b': 1, 'c': 2})
test_eq(itos, {0: 'a', 1: 'b', 2: 'c'})
test_eq(encode('cab'), [2, 0, 1])
test_eq(decode([2, 1, 0]), 'cba')


In [None]:
#| hide
import nbdev; nbdev.nbdev_export()