# Tokenizer

* Har bir belgini token deb qarashning o'rniga
* Maxsus tokenizer dan foydalanamiz
* Tokenizer nima - [BPE algoritm](https://en.wikipedia.org/wiki/Byte_pair_encoding).

## BPE tokenizer

Quyidagi harflar ketma-ketligi berilgan deb tassavur qilaylik:

aaabdaaabac

Unda eng ko'p uchragan juftlik `aa` hisoblanadi. Bu juftlikni `Z` bilan almashtiraylik:

ZabdZabac
Z=aa

Keyingi keng ko'p takrorlangan juftlik `ab` ni `Y` bilan almashtiraylik:

ZYdZYac
Y=ab
Z=aa

Ushbu almashtirishlardan so'ng biz boshqa ushbu `aaabdaaabac` belgilar ketma-ketligini boshqa almashtira olmaymiz.

In [1]:
from tokenizers import (Tokenizer, 
                        models, 
                        pre_tokenizers, 
                        decoders, 
                        trainers,
                        processors,
                        normalizers)
import datasets
import glob
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(
    add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./shaytonat 1-3-train.txt",
], trainer=trainer)

# And Save it
tokenizer.save("shaytonat-token.json", 
               pretty=True)

In [3]:
tokenizer = Tokenizer.from_file("./shaytonat-token.json")

print(tokenizer.get_vocab_size())

encoded = tokenizer.encode("У жойига ётди. Аввалига ёлғизликдан бир оз қўрқди. Сўнг ухлаб қолди. Бу сафар ошқозони таталаб уйғонди.")
len(encoded.ids)

20000


23

In [4]:
tokenizer.decode(encoded.ids)

' У жойига ётди. Аввалига ёлғизликдан бир оз қўрқди. Сўнг ухлаб қолди. Бу сафар ошқозони таталаб уйғонди.'

# Tokenizerni qonunlar to'plamida o'rgatish

In [6]:
tokenizer = Tokenizer(models.BPE())
tokenizer.normalizer = normalizers.NFKC()
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
tokenizer.decoder = decoders.ByteLevel()
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
)

tokenizer.train(list(glob.glob('/home/tqqt1/AI/projects/gpts/training/data/lexuz/train/*.txt')), trainer=trainer)
tokenizer.save("lexuz-token.json", 
               pretty=True)






In [8]:
print('Fayllar soni: ', len(list(glob.glob('/home/tqqt1/AI/projects/gpts/training/data/lexuz/train/*.txt'))))
print('Fayllar soni: ', len(list(glob.glob('/home/tqqt1/AI/projects/gpts/training/data/lexuz/val/*.txt'))))

Fayllar soni:  7774
Fayllar soni:  864


In [16]:
class DataLoader:

    def __init__(self,
                 block_size,
                 batch_size,
                 data_dir,
                 tokenizer,
                 drop_last=True,
                 shuffle=True):
        
        self.block_size = block_size
        self.batch_size = batch_size
        self.drop_last = drop_last
        self.shuffle = shuffle

        files = list(glob.glob(f'{data_dir}/*.txt'))
        self.tokens = []
        for file_path in files:
            with open(file_path, 'r') as f:
                text = ' '.join(f.readlines())
                self.tokens.append(tokenizer.encode(text).ids)
        self.tokens = np.concatenate(self.tokens)
    
    def __len__(self):
        return (self.tokens.shape[0] - self.block_size - 1) // self.batch_size + 0 if self.drop_last  else 1

    def __iter__(self):
        indices = np.arange(self.tokens.shape[0])
        if self.shuffle:
            np.random.shuffle(indices)
        
        content_size = self.block_size * self.batch_size

        for batch_idx in range(len(self)):
            xb = self.tokens[batch_idx:batch_idx+content_size]
            yb = self.tokens[batch_idx+1:batch_idx+content_size+1]

            xb = np.reshape(xb, (self.batch_size, -1))
            yb = np.reshape(yb, (self.batch_size, -1))

            yield xb, yb

data_dir = '/home/tqqt1/AI/projects/gpts/training/data/lexuz/val'

val_loader = DataLoader(block_size=128,
                        batch_size=32,
                        data_dir=data_dir,
                        tokenizer=tokenizer)
xb, yb = next(iter(val_loader))

# Tokenizerni O'zbek tilidagi kitoblar to'plamida o'rgatish

In [5]:
dataset = datasets.load_dataset(
    "tahrirchi/uz-books", 
    split="lat")


In [6]:
def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

In [None]:
tokenizer = Tokenizer(models.BPE())
tokenizer.normalizer = normalizers.NFKC()
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
tokenizer.decoder = decoders.ByteLevel()
trainer = trainers.UnigramTrainer(
    vocab_size=20000,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
)
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer, length=len(dataset))

# GPT

1. Pre-training -> Training - unsupervised  <-> BERT -> Harajat

2. Fine-tuning: kiruvchi matn <-> chiquvchi matn -> Supervised -> LoRA -> low-rank adaptation 

In [None]:
64
32
16
8
4

In [6]:
W = 3000*10000
W

30000000

In [7]:
A = 3000*10
B = 10*10000
A+B

130000

In [None]:
A*B

# Google Tarjimon