## Malayalam Language Model from Scratch

[How to train a new language model from scratch using Transformers and Tokenizers](https://huggingface.co/blog/how-to-train)

[New Language Model](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=LTXXutqeDzPi)

In [None]:
!pip uninstall -y tensorflow

Found existing installation: tensorflow 2.6.0
Uninstalling tensorflow-2.6.0:
  Successfully uninstalled tensorflow-2.6.0


In [None]:
!pip install -Uqq transformers transformers['sentencepiece'] torch datasets wandb  

[K     |████████████████████████████████| 2.8 MB 28.5 MB/s 
[K     |████████████████████████████████| 264 kB 52.5 MB/s 
[K     |████████████████████████████████| 1.7 MB 38.3 MB/s 
[K     |████████████████████████████████| 3.3 MB 40.1 MB/s 
[K     |████████████████████████████████| 895 kB 40.9 MB/s 
[K     |████████████████████████████████| 50 kB 5.4 MB/s 
[K     |████████████████████████████████| 636 kB 50.0 MB/s 
[K     |████████████████████████████████| 243 kB 61.4 MB/s 
[K     |████████████████████████████████| 119 kB 51.7 MB/s 
[K     |████████████████████████████████| 133 kB 58.3 MB/s 
[K     |████████████████████████████████| 97 kB 7.4 MB/s 
[K     |████████████████████████████████| 170 kB 52.1 MB/s 
[K     |████████████████████████████████| 63 kB 2.0 MB/s 
[K     |████████████████████████████████| 1.1 MB 42.4 MB/s 
[?25h  Building wheel for subprocess32 (setup.py) ... [?25l[?25hdone
  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [19]:
!git lfs install

git: 'lfs' is not a git command. See 'git --help'.

The most similar command is
	log


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
from datasets import load_dataset

# Common Functions

In [None]:
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�Utrnle\_]'
unicode_ignore_regex = r'[\u200e\u200c\u200d]'
english_ignore_regex = r'[a-zA-Z]'

def remove_special_characters(batch):
    batch["text"] = batch["text"].strip()
    batch["text"] = re.sub(chars_to_ignore_regex, '', batch["text"])
    batch["text"] = re.sub(unicode_ignore_regex, '', batch["text"]) + " "
    batch["text"] = re.sub(english_ignore_regex, '', batch["text"]) + " "
    return batch

In [None]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
  with torch.no_grad():
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Tokenization

In [None]:
!wget 'https://calicut.qburst.in/commoncrawl/malayalam/2020-10/malayalam_filtered_html_body.tar.gz'
!tar -xf malayalam_filtered_html_body.tar.gz

In [None]:
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path("/content/malayalam_filtered_html_body").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

In [None]:
!mkdir Malayalam2021BERTo
tokenizer.save_model("Malayalam2021BERTo")

In [None]:
from google.colab import files
files.download("Malayalam2021BERTo/vocab.json")
files.download("Malayalam2021BERTo/merges.txt")
files.download("Malayalam2021BERTo/config.json")

In [None]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [None]:
tokenizer.decode(tokenizer.encode("മത്സര പ്രതിഫലമായി സ്വന്തമാക്കിയത് പതിനേഴ്.").ids)

# Fine Tuning

In [None]:
!nvidia-smi

In [None]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

In [None]:
!cp -r /content/drive/MyDrive/'Colab Notebooks'/Hugging_Face/mymodels/Malayalam2021BERTo .

### Data Prep

In [None]:
base_url = 'https://huggingface.co/datasets/rajeshradhakrishnan/malayalam_2020_wiki/resolve/main/'
# dataset = load_dataset('text', data_files={'train': [base_url + '000000_html_body.txt', base_url + '000001_html_body.txt']})
dataset = load_dataset('text', data_files={'train': base_url + '000000_html_body.txt'})

Using custom data configuration default-757d878a7ee1cdfe


Downloading and preparing dataset text/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/text/default-757d878a7ee1cdfe/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5...


Downloading:   0%|          | 0.00/323M [00:00<?, ?B/s]

0 tables [00:00, ? tables/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-757d878a7ee1cdfe/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5. Subsequent calls will reuse this data.


In [None]:
dataset['train'] = dataset['train'].map(remove_special_characters)

  0%|          | 0/1209542 [00:00<?, ?ex/s]

In [None]:
dataset['train']['text'][0], len(dataset['train'])

('കേരളത്തിൽ വീടുകൾക്ക് സർക്കാർ വക ഇൻഷുറൻസ് വേണം  ', 1209542)

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

  0%|          | 0/1210 [00:00<?, ?ba/s]

In [None]:
tokenized_datasets["train"].save_to_disk('/content/drive/MyDrive/Colab Notebooks/Hugging_Face/mymodels/train')

### Pre Load Dataset

In [None]:
from datasets import load_from_disk
train_datasets = load_from_disk('/content/drive/MyDrive/Colab Notebooks/Hugging_Face/mymodels/train')

In [None]:
train_datasets['text'][0], len(train_datasets)

('കേരളത്തിൽ വീടുകൾക്ക് സർക്കാർ വക ഇൻഷുറൻസ് വേണം  ', 1209542)

In [None]:
small_train_dataset = train_datasets.shuffle(seed=42).select(range(1000))

Loading cached shuffled indices for dataset at /content/drive/MyDrive/Colab Notebooks/Hugging_Face/mymodels/train/cache-7b606e79641bf4c7.arrow


### Setup Model & Tokenizer

In [None]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./Malayalam2021BERTo", max_len=512)

In [None]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [None]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [None]:
# model.num_parameters()

In [None]:
# config.save_pretrained("./Malayalam2021BERTo") 

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

## Training

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
# import wandb
# wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mrajeshmvk[0m (use `wandb login --relogin` to force relogin)


True

In [None]:
# %env WANDB_PROJECT=ml-base

env: WANDB_PROJECT=ml-base


In [None]:
# https://discuss.huggingface.co/t/colab-session-crashing-after-using-all-available-ram/3224

In [None]:
from transformers import Trainer, TrainingArguments
# report_to="wandb",  # enable logging to W&B
# run_name="ml-robertaformaskedlm-lr",  # name of the W&B run (optional)
# Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred
training_args = TrainingArguments(
    output_dir="./Malayalam2021BERTo/checkpoint",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=small_train_dataset,
    compute_metrics=compute_metrics
)

Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [14]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: text.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 1000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 500
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


Step,Training Loss


Step,Training Loss
500,5.1692




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=500, training_loss=5.16921435546875, metrics={'train_runtime': 5194.349, 'train_samples_per_second': 0.193, 'train_steps_per_second': 0.096, 'total_flos': 132627136512000.0, 'train_loss': 5.16921435546875, 'epoch': 1.0})

In [None]:
# wandb.finish()

In [25]:
trainer.evaluate()

ValueError: ignored

In [None]:
!huggingface-cli login

In [None]:
trainer.save_model("./malayalam-wiki2021-BERTo")
# tokenizer.save_model("./malayalam-wiki2021-BERTo")

In [24]:
from google.colab import files
files.download("malayalam-wiki2021-BERTo/pytorch_model.bin")
files.download("malayalam-wiki2021-BERTo/training_args.bin")
files.download("malayalam-wiki2021-BERTo/config.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
model.push_to_hub('malayalam-wiki2021-BERTo')

In [None]:
trainer.push_to_hub(reponame ='malayalam-wiki2021-BERTo')

In [None]:
tokenizer.push_to_hub('malayalam-wiki2021-BERTo')


https://huggingface.co/transformers/model_sharing.html

Finally, when you have a nice model, please think about sharing it with the community:

- upload your model using the CLI: `transformers-cli upload`
- write a README.md model card and add it to the repository under `model_cards/`. Your model card should ideally include:
    - a model description,
    - training params (dataset, preprocessing, hyperparameters), 
    - evaluation results,
    - intended uses & limitations
    - whatever else is helpful! 🤓

# Downstream Tasks

## Fill Masked word

In [28]:
from transformers import pipeline

# fill_mask = pipeline(
#     "fill-mask",
#     model="./malayalam-wiki2021-BERTo",
#     tokenizer="./Malayalam2021BERTo"
# )

# fill_mask = pipeline(
#     "fill-mask",
#     model="eliasedwin7/MalayalamBERTo",
#     tokenizer="eliasedwin7/MalayalamBERTo"
# )

fill_mask = pipeline(
    "fill-mask",
    model="rajeshradhakrishnan/malayalam-wiki2021-BERTo",
    tokenizer="rajeshradhakrishnan/malayalam-wiki2021-BERTo"
)


https://huggingface.co/rajeshradhakrishnan/malayalam-wiki2021-BERTo/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp4tufjkm8


Downloading:   0%|          | 0.00/671 [00:00<?, ?B/s]

storing https://huggingface.co/rajeshradhakrishnan/malayalam-wiki2021-BERTo/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/838e64c4016f002ea22727d8052795a8044a7c89d8b68989683fb713ce71b37c.713a2f87c7c2c67b2d44643b87e6d7f9704e73aa64edb3f39a009de9fb0a4c5f
creating metadata file for /root/.cache/huggingface/transformers/838e64c4016f002ea22727d8052795a8044a7c89d8b68989683fb713ce71b37c.713a2f87c7c2c67b2d44643b87e6d7f9704e73aa64edb3f39a009de9fb0a4c5f
loading configuration file https://huggingface.co/rajeshradhakrishnan/malayalam-wiki2021-BERTo/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/838e64c4016f002ea22727d8052795a8044a7c89d8b68989683fb713ce71b37c.713a2f87c7c2c67b2d44643b87e6d7f9704e73aa64edb3f39a009de9fb0a4c5f
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": fal

Downloading:   0%|          | 0.00/334M [00:00<?, ?B/s]

storing https://huggingface.co/rajeshradhakrishnan/malayalam-wiki2021-BERTo/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/c2528fab5c6a3c2c003d18f38cc7f25626e87fc4d37e6b6df02f1cf9c4190049.7e11f95cf4373053c18890eb10c45d4900284bc0298cd3cbc4a09e21fe436e24
creating metadata file for /root/.cache/huggingface/transformers/c2528fab5c6a3c2c003d18f38cc7f25626e87fc4d37e6b6df02f1cf9c4190049.7e11f95cf4373053c18890eb10c45d4900284bc0298cd3cbc4a09e21fe436e24
loading weights file https://huggingface.co/rajeshradhakrishnan/malayalam-wiki2021-BERTo/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/c2528fab5c6a3c2c003d18f38cc7f25626e87fc4d37e6b6df02f1cf9c4190049.7e11f95cf4373053c18890eb10c45d4900284bc0298cd3cbc4a09e21fe436e24
All model checkpoint weights were used when initializing RobertaForMaskedLM.

All the weights of RobertaForMaskedLM were initialized from the model checkpoint at rajeshradhakrishnan/malayalam-wiki2021-BERTo.
If your 

Downloading:   0%|          | 0.00/1.50M [00:00<?, ?B/s]

storing https://huggingface.co/rajeshradhakrishnan/malayalam-wiki2021-BERTo/resolve/main/vocab.json in cache at /root/.cache/huggingface/transformers/7db4dbbb36576d05f9fa3ffa9edf198df7c6bdbffb6ff209a20126af8a319b40.f26d0ed014f9c78408a14bbd010f45d541e8909c03368f1150f83c8095611fb0
creating metadata file for /root/.cache/huggingface/transformers/7db4dbbb36576d05f9fa3ffa9edf198df7c6bdbffb6ff209a20126af8a319b40.f26d0ed014f9c78408a14bbd010f45d541e8909c03368f1150f83c8095611fb0
https://huggingface.co/rajeshradhakrishnan/malayalam-wiki2021-BERTo/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpxqstplef


Downloading:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

storing https://huggingface.co/rajeshradhakrishnan/malayalam-wiki2021-BERTo/resolve/main/merges.txt in cache at /root/.cache/huggingface/transformers/cf2b8f9c418ffb00ba471a54da8cd32bf956a3af9da20fd2d44a25a36d76ddb0.fe59d54b1af5b9e7fbdb0fc4840f7e81e2def8290fb4d1bd4a06762c7dbaf558
creating metadata file for /root/.cache/huggingface/transformers/cf2b8f9c418ffb00ba471a54da8cd32bf956a3af9da20fd2d44a25a36d76ddb0.fe59d54b1af5b9e7fbdb0fc4840f7e81e2def8290fb4d1bd4a06762c7dbaf558
loading file https://huggingface.co/rajeshradhakrishnan/malayalam-wiki2021-BERTo/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/7db4dbbb36576d05f9fa3ffa9edf198df7c6bdbffb6ff209a20126af8a319b40.f26d0ed014f9c78408a14bbd010f45d541e8909c03368f1150f83c8095611fb0
loading file https://huggingface.co/rajeshradhakrishnan/malayalam-wiki2021-BERTo/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/cf2b8f9c418ffb00ba471a54da8cd32bf956a3af9da20fd2d44a25a36d76ddb0.fe59d54b1af5b9e7

In [27]:
fill_mask("മത്സര പ്രതിഫലമായി സ്വന്തമാക്കിയത് പതിനേഴ് <mask>.")

[{'score': 0.11835546046495438,
  'sequence': 'മത്സര പ്രതിഫലമായി സ്വന്തമാക്കിയത് പതിനേഴ്സൺ.',
  'token': 2932,
  'token_str': 'സൺ'},
 {'score': 0.11285492777824402,
  'sequence': 'മത്സര പ്രതിഫലമായി സ്വന്തമാക്കിയത് പതിനേഴ്ച.',
  'token': 286,
  'token_str': 'ച'},
 {'score': 0.09058760851621628,
  'sequence': 'മത്സര പ്രതിഫലമായി സ്വന്തമാക്കിയത് പതിനേഴ് തവണ.',
  'token': 1298,
  'token_str': ' തവണ'},
 {'score': 0.038631219416856766,
  'sequence': 'മത്സര പ്രതിഫലമായി സ്വന്തമാക്കിയത് പതിനേഴ് വർധന.',
  'token': 6168,
  'token_str': ' വർധന'},
 {'score': 0.03196218982338905,
  'sequence': 'മത്സര പ്രതിഫലമായി സ്വന്തമാക്കിയത് പതിനേഴ് ചർച.',
  'token': 1181,
  'token_str': ' ചർച'}]

In [29]:
fill_mask("ത്സര പ്രതിഫലമായി <mask>.")

[{'score': 0.2925995886325836,
  'sequence': 'ത്സര പ്രതിഫലമായി്.',
  'token': 263,
  'token_str': '്'},
 {'score': 0.11167321354150772,
  'sequence': 'ത്സര പ്രതിഫലമായിി.',
  'token': 265,
  'token_str': 'ി'},
 {'score': 0.049256227910518646,
  'sequence': 'ത്സര പ്രതിഫലമായിു.',
  'token': 268,
  'token_str': 'ു'},
 {'score': 0.04570918157696724,
  'sequence': 'ത്സര പ്രതിഫലമായിാ.',
  'token': 269,
  'token_str': 'ാ'},
 {'score': 0.02686399593949318,
  'sequence': 'ത്സര പ്രതിഫലമായിക.',
  'token': 266,
  'token_str': 'ക'}]

## Classification

In [None]:
dataset_cls = load_dataset("rajeshradhakrishnan/malayalam_news")

In [None]:
dataset_cls['train']['text'][:10]