<a href="https://colab.research.google.com/github/rdkdaniel/The-Swahili-Project/blob/main/Swahili_Language_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The General Model**

Here I develop a Kiswahili language model. This model will then be fine-tuned for downstream tasks. The objectives for this model are:


1.   Getting the data - we shall use the established Swahili dataset (Source:).
2.   Design the tokenizer (we already have, so we will call it).
3.   Create an input pipeline.
4.   Train the model


# **A Full Training Guide**

## **Libraries**

In [1]:
pip install datasets


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 7.9 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 54.9 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 57.7 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 47.4 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 46.5 MB/s 
Installing collected 

In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 8.7 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 50.5 MB/s 
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.2 transformers-4.24.0


In [3]:
import transformers

In [4]:
import pandas as pd

## **1.0 Getting the Data Dataset**

In [None]:
#Use one on huggingface
df = pd.read_csv('/content/drive/MyDrive/train.csv')
print(df)

            id                                            content category
0       SW4670   Bodi ya Utalii Tanzania (TTB) imesema, itafan...   uchumi
1      SW30826   PENDO FUNDISHA-MBEYA RAIS Dk. John Magufuri, ...  kitaifa
2      SW29725  Mwandishi Wetu -Singida BENKI ya NMB imetoa ms...   uchumi
3      SW20901   TIMU ya taifa ya Tanzania, Serengeti Boys jan...  michezo
4      SW12560   Na AGATHA CHARLES – DAR ES SALAAM ALIYEKUWA K...  kitaifa
...        ...                                                ...      ...
23263  SW24920   Alitoa pongezi hizo alipozindua rasmi hatua y...   uchumi
23264   SW4038   Na NORA DAMIAN-DAR ES SALAAM  TEKLA (si jina ...  kitaifa
23265  SW16649   Mkuu wa Mkoa wa Njombe, Dk Rehema Nchimbi wak...   uchumi
23266  SW23291   MABINGWA wa Ligi Kuu Soka Tanzania Bara, Simb...  michezo
23267  SW11778   WIKI iliyopita, nilianza makala haya yanayole...  kitaifa

[23268 rows x 3 columns]


## **2.0 The Tokenizer**

In [5]:
from transformers import PreTrainedTokenizerFast

In [6]:
# initialize the tokenizer using the tokenizer we initialized and saved to file
tokenizer = PreTrainedTokenizerFast.from_pretrained('/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer')

*We can attempt to encode some text with it*

In [7]:
# test our tokenizer on a simple sentence
tokens = tokenizer('Jumbo, habari yako?')

In [8]:
print(tokens)

{'input_ids': [2, 741, 387, 12, 1265, 808, 29, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


In [9]:
tokens.input_ids

[2, 741, 387, 12, 1265, 808, 29, 3]

## **3.0 Input Pipeline**

### **3.1 Small Data Prep**

## **4.0 Training the Model**

# **A Small Test**

In [10]:
from transformers import PreTrainedTokenizerFast

In [12]:
checkpoint = '/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer'
tokenizer = PreTrainedTokenizerFast.from_pretrained(checkpoint)

In [13]:
from datasets import load_dataset

raw_datasets = load_dataset("swahili")
raw_datasets

Downloading builder script:   0%|          | 0.00/3.82k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.67k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading and preparing dataset swahili/swahili to /root/.cache/huggingface/datasets/swahili/swahili/1.0.0/15bf1d99abb939f83b5da3c798ed55e9803b3ea430f06bf7e003bd073b60172a...


Downloading data:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/42069 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3371 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3372 [00:00<?, ? examples/s]

Dataset swahili downloaded and prepared to /root/.cache/huggingface/datasets/swahili/swahili/1.0.0/15bf1d99abb939f83b5da3c798ed55e9803b3ea430f06bf7e003bd073b60172a. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 42069
    })
    test: Dataset({
        features: ['text'],
        num_rows: 3371
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3372
    })
})

In [14]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'text': 'taarifa hiyo ilisema kuwa ongezeko la joto la maji juu ya wastani katikati ya bahari ya UNK inaashiria kuwepo kwa mvua za el nino UNK hadi mwishoni mwa april ishirini moja sifuri imeelezwa kuwa ongezeko la joto magharibi mwa bahari ya hindi linatarajiwa kuhamia katikati ya bahari hiyo hali ambayo itasababisha pepo kutoka kaskazini mashariki kuvuma kuelekea bahari ya hindi'}

In [15]:
raw_train_dataset.features

{'text': Value(dtype='string', id=None)}

In [16]:
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained('/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer')

tokenized_sentences = tokenizer(raw_datasets["train"]["text"])


In [17]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["text"],
    padding=True,
    truncation=True,
)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [18]:
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)

In [19]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

  0%|          | 0/43 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 42069
    })
    test: Dataset({
        features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3371
    })
    validation: Dataset({
        features: ['text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3372
    })
})

In [20]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [21]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "text"]}
[len(x) for x in samples["input_ids"]]

[123, 69, 49, 68, 25, 77, 47, 51]

In [22]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': torch.Size([8, 123]),
 'token_type_ids': torch.Size([8, 123]),
 'attention_mask': torch.Size([8, 123])}

Training

In [23]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

In [25]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

#Error says the config.json file is absent
#Will remane tokenizer_config.json to config.json to see outcome
#did not work - renaming confif file.

ValueError: ignored

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 42069
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 15777
  Number of trainable parameters = 109483778


ValueError: ignored

In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 3372
  Batch size = 8


AttributeError: ignored

In [None]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

NameError: ignored

# **A Small Test 2**

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("swahili_news")
raw_datasets

Downloading builder script:   0%|          | 0.00/4.62k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.25k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.15k [00:00<?, ?B/s]

Downloading and preparing dataset swahili_news/swahili_news to /root/.cache/huggingface/datasets/swahili_news/swahili_news/0.2.0/ed5c9a13b97e0d2864ff1e34bfbd38b2f2c54fea77acffcaef187eb4f13cf8cc...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

  

Extracting data files #0:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #1:   0%|          | 0/1 [00:00<?, ?obj/s]

Generating train split:   0%|          | 0/22207 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7338 [00:00<?, ? examples/s]

Dataset swahili_news downloaded and prepared to /root/.cache/huggingface/datasets/swahili_news/swahili_news/0.2.0/ed5c9a13b97e0d2864ff1e34bfbd38b2f2c54fea77acffcaef187eb4f13cf8cc. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 22207
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7338
    })
})

In [None]:
raw_datasets['train']

Dataset({
    features: ['text', 'label'],
    num_rows: 22207
})

In [None]:
raw_datasets['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['uchumi', 'kitaifa', 'michezo', 'kimataifa', 'burudani', 'afya'], id=None)}

In [None]:
raw_datasets['train'][0]

{'text': ' Bodi ya Utalii Tanzania (TTB) imesema, itafanya misafara ya kutangaza utalii kwenye miji minne nchini China kati ya Juni 19 hadi Juni 26 mwaka huu.Misafara hiyo itatembelea miji ya Beijing Juni 19, Shanghai Juni 21, Nanjig Juni 24 na Changsha Juni 26.Mwenyekiti wa bodi TTB, Jaji Mstaafu Thomas Mihayo ameyasema hayo kwenye mkutano na waandishi wa habari jijini Dar es Salaam.“Tunafanya jitihada kuhakikisha tunavuna watalii wengi zaidi kutoka China hasa tukizingatia umuhimu wa soko la sekta ya utalii nchini,” amesema Jaji Mihayo.Novemba 2018 TTB ilifanya ziara kwenye miji ya Beijing, Shanghai, Chengdu, Guangzhou na Hong Kong kutangaza vivutio vya utalii sanjari kuzitangaza safari za ndege za Air Tanzania.Ziara hiyo inaelezwa kuzaa matunda ikiwa ni pamoja na watalii zaidi ya 300 kuja nchini Mei mwaka huu kutembelea vivutio vya utalii.',
 'label': 0}

In [None]:
from tqdm.auto import tqdm

In [None]:
text_data = []
file_count = 0

for sample in tqdm(raw_datasets['train']):
    sample = sample['text'].replace('\n', '')
    text_data.append(sample)
    if len(text_data) == 10_000:
        # once we git the 10K mark, save to file
        with open(f'/content/drive/MyDrive/Kiswahili_Dataset/Kisw-Model/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
            fp.write('\n'.join(text_data))
        text_data = []
        file_count += 1
# after saving in 10K chunks, we will have ~2082 leftover samples, we save those now too
with open(f'/content/drive/MyDrive/Kiswahili_Dataset/Kisw-Model/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(text_data))



  0%|          | 0/22207 [00:00<?, ?it/s]

Initialize my tokenizer

In [None]:
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained('/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer')

In [None]:
# test our tokenizer on a simple sentence
tokens = tokenizer('Habari yako')

In [None]:
print(tokens)

{'input_ids': [2, 1265, 808, 3], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}


In [None]:
tokens.input_ids

[2, 1265, 808, 3]

Design a new tokenizer

In [None]:
from pathlib import Path
paths = [str(x) for x in Path('/content/drive/MyDrive/Kiswahili_Dataset/Kisw-Model').glob('**/*.txt')]

In [None]:
from tokenizers import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()

In [None]:
tokenizer.train(files=paths[:5], vocab_size=30_522, min_frequency=2,
                special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])

In [None]:
import os

os.mkdir('/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer2')

tokenizer.save_model('/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer2')

['/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer2/vocab.json',
 '/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer2/merges.txt']

Initialize new tokenizer

In [None]:
from transformers import RobertaTokenizer

# initialize the tokenizer using the tokenizer we initialized and saved to file
tokenizer = RobertaTokenizer.from_pretrained('/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer2', max_len=512)

In [None]:
# test our tokenizer on a simple sentence
tokens = tokenizer('habari za leo')
print(tokens)

{'input_ids': [0, 4093, 325, 976, 2], 'attention_mask': [1, 1, 1, 1, 1]}


In [None]:
tokens.input_ids

[0, 4093, 325, 976, 2]

Pipeline

In [None]:
with open('/content/drive/MyDrive/Kiswahili_Dataset/Kisw-Model/text_0.txt', 'r', encoding='utf-8') as fp:
    lines = fp.read().split('\n')

In [None]:
batch = tokenizer(lines, max_length=512, padding='max_length', truncation=True)
len(batch)

2

In [None]:
import torch

labels = torch.tensor([x.ids for x in batch])
mask = torch.tensor([x.attention_mask for x in batch])

AttributeError: ignored

Same error, so not my tokenizer

# **Small Test 3**

In [None]:
tokenizer_path = "/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer2"
model_path = "/content/drive/MyDrive/Kiswahili_Dataset/My kisw model"

In [None]:
from datasets import load_dataset

files = ['/content/drive/MyDrive/Kiswahili_Dataset/Kiswahili_data1.txt', '/content/drive/MyDrive/Kiswahili_Dataset/Kiswahili_data2.txt']       # train & test splits genreted from the combined scraped data
dataset = files
# download and prepare cc_news dataset
dataset = load_dataset("text", data_files=files)



  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
dataset = dataset['train']

In [None]:
# split the dataset into training (90%) and testing (10%)
d = dataset.train_test_split(test_size=0.1)

d["train"], d["test"]

(Dataset({
     features: ['text'],
     num_rows: 4
 }), Dataset({
     features: ['text'],
     num_rows: 1
 }))

In [None]:
special_tokens = [
    "[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]", "", ""
]

# if you want to train the tokenizer on both sets
# files = ["train.txt", "test.txt"]

# training the tokenizer on the training set to avoid overfitting
files = ["train.txt"]    

# 30,522 vocab is BERT's default vocab size, feel free to tweak
vocab_size = 30_522

# maximum sequence length, lowering will result to faster training (when increasing batch size)
max_length = 512

# whether to truncate
truncate_longer_samples = True

In [None]:
from tokenizers import BertWordPieceTokenizer

In [None]:
# initialize the WordPiece tokenizer
tokenizer = BertWordPieceTokenizer()

# train the tokenizer
tokenizer.train

# enable truncation up to the maximum 512 tokens
tokenizer.enable_truncation(max_length=max_length)

In [None]:
import os
import json

# make the directory if not already there
#if not os.path.isdir(tokenizer_path):
    #os.mkdir(tokenizer_path)

# save the tokenizer
tokenizer.save_model(tokenizer_path)

# dumping some of the tokenizer config to config file,
# including special tokens, whether to lower case and the maximum sequence length
with open(os.path.join(tokenizer_path, "config.json"), "w") as f:
    tokenizer_cfg = {
        "do_lower_case": True,
        "unk_token": "[UNK]",
        "sep_token": "[SEP]",
        "pad_token": "[PAD]",
        "cls_token": "[CLS]",
        "mask_token": "[MASK]",
        "model_max_length": max_length,
        "max_len": max_length,
    }
    json.dump(tokenizer_cfg, f)



In [None]:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained(tokenizer_path)
print(tokenizer)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


PreTrainedTokenizerFast(name_or_path='/content/drive/MyDrive/Kiswahili_Dataset/RDK-Kisw-Tokenizer2', vocab_size=0, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})


Fill Mask Model Training

In [None]:
def encode_with_truncation(examples):
    """Mapping function to tokenize the sentences passed with truncation"""
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=max_length, return_special_tokens_mask=True)


def encode_without_truncation(examples):
    """Mapping function to tokenize the sentences passed without truncation"""
    return tokenizer(examples["text"], return_special_tokens_mask=True)


# the encode function will depend on the truncate_longer_samples variable
encode = encode_with_truncation if truncate_longer_samples else encode_without_truncation

# tokenizing the train dataset
train_dataset = d["train"].map(encode, batched=True)

# tokenizing the testing dataset
test_dataset = d["test"].map(encode, batched=True)

if truncate_longer_samples:
    # remove other columns and set input_ids and attention_mask as
    train_dataset.set_format(type="torch", columns=[
                             "input_ids", "attention_mask"])
    test_dataset.set_format(type="torch", columns=[
                            "input_ids", "attention_mask"])
else:
    test_dataset.set_format(
        columns=["input_ids", "attention_mask", "special_tokens_mask"])
    train_dataset.set_format(
        columns=["input_ids", "attention_mask", "special_tokens_mask"])
train_dataset, test_dataset

  0%|          | 0/1 [00:00<?, ?ba/s]

Exception: ignored

In [None]:


# Main data processing function that will concatenate all texts from our dataset and generate chunks of
# max_seq_length.
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= max_length:
        total_length = (total_length // max_length) * max_length
    # Split by chunks of max_len.
    result = {
        k: [t[i: i + max_length] for i in range(0, total_length, max_length)]
        for k, t in concatenated_examples.items()
    }
    return result


if not truncate_longer_samples:
    train_dataset = train_dataset.map(group_texts, batched=True, batch_size=2_000,
                                      desc=f"Grouping texts in chunks of {max_length}")
    test_dataset = test_dataset.map(group_texts, batched=True, batch_size=2_000,
                                    num_proc=4, desc=f"Grouping texts in chunks of {max_length}")



In [None]:
from transformers import BertConfig, BertForMaskedLM


# initialize the model with the config
model_config = BertConfig(vocab_size=vocab_size, max_position_embeddings=max_length)
model = BertForMaskedLM(config=model_config)

In [None]:


from transformers import DataCollatorForLanguageModeling

# initialize the data collator, randomly masking 20% (default is 15%) of the tokens for the Masked Language
# Modeling (MLM) task

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.2
)



In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=model_path,          # output directory to where save model checkpoint
    evaluation_strategy="steps",    # evaluate each `logging_steps` steps
    overwrite_output_dir=True,      
    num_train_epochs=10,            # number of training epochs, feel free to tweak
    per_device_train_batch_size=10, # the training batch size, put it as high as your GPU memory fits
    gradient_accumulation_steps=8,  # accumulating the gradients before updating the weights
    per_device_eval_batch_size=64,  # evaluation batch size
    logging_steps=500,             # evaluate, log and save model checkpoints every 1000 step
    save_steps=500,
    # load_best_model_at_end=True,  # whether to load the best model (in terms of loss) at the end of training
    # save_total_limit=3,           # whether you don't have much space so you let only 3 model weights saved in the disk
)

using `logging_steps` to initialize `eval_steps` to 500
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:


from transformers import Trainer

# initialize the trainer and pass everything to it
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)


In [None]:
trainer.train() 

The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: special_tokens_mask, text. If special_tokens_mask, text are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 4
  Num Epochs = 10
  Instantaneous batch size per device = 10
  Total train batch size (w. parallel, distributed & accumulation) = 80
  Gradient Accumulation steps = 8
  Total optimization steps = 10
  Number of trainable parameters = 109514298
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=10, training_loss=1.2318174362182617, metrics={'train_runtime': 311.908, 'train_samples_per_second': 0.128, 'train_steps_per_second': 0.032, 'total_flos': 10528192512000.0, 'train_loss': 1.2318174362182617, 'epoch': 10.0})

In [None]:
import os

# load the model checkpoint
model = BertForMaskedLM.from_pretrained(model_path)      # change this to FillMask_Model i.e. change path

# load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained(tokenizer_path)

OSError: ignored