## Train a German SentencePiece Tokenizer

Uses the dataset created via "make_german_ds.py" to train a german Sentencepiece Tokenizer

In [None]:
from datasets import load_from_disk

ds = load_from_disk("german_ds")
corpus = ds["train"]["text"]
corpus.extend(ds["test"]["text"])

The target size for the new vocabulary is based on the vocabulary size of T5. The 100 extra-ids will be added automatically.

In [None]:
target_vocab_size = 32000

Train the SentencePiece model. This will take a while.

In [None]:
import sentencepiece as spm
import io

spm_model = io.BytesIO()
spm.SentencePieceTrainer.Train(
    sentence_iterator=(text for text in corpus[50]),
    model_writer=spm_model,
    # model_prefix='spmodel',
    vocab_size=32000, 
    pad_id=0,                
    unk_id=1,
    eos_id=2,
    bos_id=3,
    pad_piece='<pad>',
    unk_piece='<unk>',
    eos_piece='</s>',
    bos_piece='<cls>',
    # model_type='unigram'
)

Save the SentencePiece model to a new folder

In [None]:
import os

dir = "spiece_model"
model_name = "spiece.model"
os.makedirs(dir, exist_ok=True)
spm_filepath = os.path.join(dir, model_name)
with open(spm_filepath, "wb") as f:
    f.write(spm_model.getvalue())

Use MT5tokenizerFast to create a new tokenizer from our new SentencePiece model.

In [None]:
from transformers import MT5TokenizerFast

german_tokenizer = MT5TokenizerFast(os.path.join(dir, model_name))

A little test.

In [None]:
text = "Diese Antikörper bleiben auch nach der überstandenen Krankheit für einige Zeit bestehen und können vor einer erneuten Erkrankung schützen."

german_tokenizer.convert_ids_to_tokens(german_tokenizer(text)['input_ids'])

Save the new tokenizer.

In [None]:
tokenizer_path = "german_tokenizer"
german_tokenizer.save_pretrained(tokenizer_path)