Building a transformer model from scratch can often be the only option for many more specific use cases. Although BERT and other transformer models have been pre-trained for many languages and domains, they do not cover everything. 
We build WordPiece tokenizer used by BERT from scratch.

BERT uses what is called a WordPiece tokenizer. It splits word into tokens. Using word pieces allows BERT to easily identify related words as they will usually share some of the same input tokens, which are then fed into the first layers of BERT.

In [None]:
!pip install datasets 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 9.0 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 2.5 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 42.6 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 48.6 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████

In [None]:
import datasets

In [None]:
dataset=datasets.load_dataset('oscar','unshuffled_deduplicated_it',
    split='train[:2000000]')

Downloading builder script:   0%|          | 0.00/5.58k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/359k [00:00<?, ?B/s]

OSError: ignored

# Format Data

In [None]:
import os
os.mkdir('./oscar_it')

In [None]:
from tqdm.auto import tqdm

In [None]:
text_data=[]
file_count=0

for sample in tqdm(dataset):
  sample=sample['text'].replace('\n','\s')
  text_data.append(sample)
  if len(text_data)==5000:
    with open(f'./oscar_it/text_{file_count}.txt','w',encoding='utf-8') as fp:
      fp.write('\n'.join(text_data))
    text_data=[]
    file_count+=1
with open(f'./oscar_it/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
 fp.write('\n'.join(text_data))

NameError: ignored

In [None]:
from pathlib import Path
paths=[str(x) for x in Path('./oscar_it').glob('**/*.txt')]
paths[:5]

[]

# Initialize and train tokenizer

In [None]:
from tokenizers import BertWordPieceTokenizer

tokenizer= BertWordPieceTokenizer(
    clean_text=True,
    handle_chinese_chars=False,
    strip_accents=False,
    lowercase=False
)

tokenizer.train(files=paths, vocab_size=30000, min_frequency=2,  limit_alphabet=1000, wordpieces_prefix='##',
                special_tokens=[
                    '[PAD', '[UNK]', '[CLS]', '[SEP]', '[MASK]'])


In [None]:
os.mkdir('./bert-it')
tokenizer.save_model('./bert-it','bert-it')

# Load our tokenizer

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('./bert-it')

In [None]:
tokenizer('ciao! come va?')