# mBERT Model Monolingual Experiments

## Imports

In [None]:
! pip install transformers datasets --quiet

[K     |████████████████████████████████| 3.1 MB 4.2 MB/s 
[K     |████████████████████████████████| 298 kB 75.6 MB/s 
[K     |████████████████████████████████| 61 kB 449 kB/s 
[K     |████████████████████████████████| 895 kB 57.0 MB/s 
[K     |████████████████████████████████| 3.3 MB 73.2 MB/s 
[K     |████████████████████████████████| 596 kB 51.9 MB/s 
[K     |████████████████████████████████| 1.1 MB 57.1 MB/s 
[K     |████████████████████████████████| 243 kB 80.6 MB/s 
[K     |████████████████████████████████| 132 kB 75.1 MB/s 
[K     |████████████████████████████████| 271 kB 88.5 MB/s 
[K     |████████████████████████████████| 160 kB 88.9 MB/s 
[K     |████████████████████████████████| 192 kB 89.3 MB/s 
[?25h

In [None]:
! sudo apt-get install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 37 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 2s (885 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package git-lfs.
(Reading database ... 155222 files and directories cur

In [None]:
! transformers-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        
Username: vidhur2k
Password: 
ERROR:root:HfApi.login: This method is deprecated in favor of `set_access_token`.
Login successful
Your token: KzAihDRDhKJPpeIYHSuCobGeBLQrKfLphUuNMGfEvFhsgTosGnKOXMRtpMcOjYOwZkKowiOuxfxbgXebInUtEGpAKPkdPcqUFWwktmWphjaYRysxJKigjQmvJUiNWCGm 

Your token has been saved to /root/.huggingface/token


In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, AdamW, get_scheduler
from sklearn.metrics import f1_score, roc_auc_score
import datasets
from datasets import load_dataset, Dataset, load_metric
from tqdm.auto import tqdm

In [None]:
MODEL_NAME = 'bert-base-multilingual-cased'
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f'Using device: {device}')

Using device: cuda


In [None]:
def get_gpu_info():
  gpu_info = !nvidia-smi
  gpu_info = '\n'.join(gpu_info)
  if gpu_info.find('failed') >= 0:
    print('Not connected to a GPU')
  else:
    print(gpu_info)

## Preprocess the data using the tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

In [None]:
def tokenize_function(example):
    if example['text'] is None:
      return tokenizer('', truncation=True, padding='max_length')
    return tokenizer(example['text'], truncation=True, padding='max_length')

In [None]:
def load_and_tokenize_dataset(csv_file: str):
    dataset = load_dataset('csv', data_files=csv_file)
    print(dataset['train'].column_names)
    has_unnamed_col = 'Unnamed: 0' in dataset['train'].column_names
    if has_unnamed_col:
      dataset = dataset.rename_column('Unnamed: 0', 'idx')
    dataset = dataset['train'].train_test_split(test_size=0.2)
    
    tokenized_datasets = dataset.map(tokenize_function)
    for dataset in ['train', 'test']:
        if 'id' in tokenized_datasets[dataset].column_names:
          tokenized_datasets[dataset] = tokenized_datasets[dataset].remove_columns(['id'])
        if has_unnamed_col:
          tokenized_datasets[dataset] = tokenized_datasets[dataset].remove_columns(['text', 'idx', 'token_type_ids'])
        else:
          tokenized_datasets[dataset] = tokenized_datasets[dataset].remove_columns(['text', 'token_type_ids'])
        tokenized_datasets[dataset] = tokenized_datasets[dataset].rename_column('hs', 'labels')
        tokenized_datasets[dataset].set_format('torch')
    return tokenized_datasets

## Define Train and Test Loaders

In [None]:
def get_train_loader(tokenized_dataset: datasets.DatasetDict, batch_size: int):
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    train_loader = DataLoader(tokenized_dataset['train'], shuffle=True, batch_size=batch_size, collate_fn=data_collator)
    return train_loader

def get_test_loader(tokenized_dataset: datasets.DatasetDict, batch_size: int):
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    test_loader = DataLoader(tokenized_dataset['test'], shuffle=True, batch_size=batch_size, collate_fn=data_collator)
    return test_loader

## Model Training

#### Plan

We plan on training mBERT to perform the following classification scenarios

1. **Monolingual-Train Monolingual-Test**: Train it on language X and test on X as well.
2. **Multilingual-Train Monolingual-Test**: Train it on a set of languages ($X_1, X_2 \dots X_n$) and test on Y. We train in scenarios by both including and not including Y in the training set.

## Monolingual

In [None]:
# Define training hyperparameters for the monolingual scenario
n_epochs = 5
lr = 5e-5
batch_size = 64

def monolingual_train(lang, train_loader):
  print(f'Training mBERT for {lang}')
  model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)
  model.to(device)

  optimizer = AdamW(model.parameters(), lr=lr)
  n_training_steps = n_epochs * len(train_loader)
  lr_scheduler = get_scheduler(
      "linear",
      optimizer=optimizer,
      num_warmup_steps=0,
      num_training_steps=n_training_steps,
  )

  progress = tqdm(range(n_training_steps))
  model.train()

  for epoch in range(n_epochs):
    for batch in train_loader:
      batch = {k: v.to(device) for k, v in batch.items()}
      outputs = model(**batch)
      loss = outputs.loss
      loss.backward()

      optimizer.step()
      lr_scheduler.step()
      optimizer.zero_grad()
      progress.update(1)

  return model


def monolingual_test(lang, test_loader, model):
  print(f'Evaluating mBERT for {lang}')
  progress = tqdm(range(len(test_loader)))
  accuracy_metric = load_metric("accuracy")
  model.to(device)
  model.eval()
  preds = []
  trues = []
  for batch in test_loader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
      outputs = model(**batch)
    
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    accuracy_metric.add_batch(predictions=predictions, references=batch["labels"])
    preds.extend(predictions.tolist())
    trues.extend(batch['labels'].tolist())
    progress.update(1)
  
  print(accuracy_metric.compute())
  print(f'F1 Score: {f1_score(trues, preds, average="weighted")}')
  print(f'AUC Score: {roc_auc_score(trues, preds, average="weighted")}')

### English

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/vidhur2k/Multilngual-Hate-Speech/main/data/all-processed/B_german_processed.csv')

In [None]:
df['hs'].value_counts()

0    5526
1    1436
Name: hs, dtype: int64

### German

In [None]:
german_dataset = load_and_tokenize_dataset('https://raw.githubusercontent.com/vidhur2k/Multilngual-Hate-Speech/main/data/all-processed/B_german_processed.csv')

Using custom data configuration default-0889cef5111b695e
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-0889cef5111b695e/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached split indices for dataset at /root/.cache/huggingface/datasets/csv/default-0889cef5111b695e/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-dd7e0136532a28aa.arrow and /root/.cache/huggingface/datasets/csv/default-0889cef5111b695e/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-a08bab189adb2f6e.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-0889cef5111b695e/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-1fc0a6932f63ef66.arrow


Raw sample from dataset: {'idx': 3728, 'text': 'standing ovation fur rede frau le pen . wurde mal politiker/in rede halten . respekt ! !', 'hs': 0}


  0%|          | 0/1393 [00:00<?, ?ex/s]

In [None]:
train_loader = get_train_loader(german_dataset, batch_size=8)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)

NameError: ignored

In [None]:
get_gpu_info()

Thu Dec  2 19:36:05 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    33W / 250W |   1855MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
optimizer = AdamW(model.parameters(), lr=lr)

In [None]:
n_training_steps = n_epochs * len(train_loader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=n_training_steps,
)

In [None]:
progress = tqdm(range(n_training_steps))
model.train()

for epoch in range(n_epochs):
  for batch in train_loader:
    batch = {k: v.to(device) for k, v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress.update(1)

  0%|          | 0/3485 [00:00<?, ?it/s]

In [None]:
prec_metric = load_metric("precision")
recall_metric = load_metric("recall")

In [None]:
test_loader = get_test_loader(german_dataset, batch_size=8)

In [None]:
model.push_to_hub("vidhur2k/multilingual-hate-speech/mBERT-German-Mono")

Cloning https://huggingface.co/vidhur2k/mBERT-German-Mono into local empty directory.


Upload file pytorch_model.bin:   0%|          | 3.38k/679M [00:00<?, ?B/s]

To https://huggingface.co/vidhur2k/mBERT-German-Mono
   becf86e..a90080b  main -> main



'https://huggingface.co/vidhur2k/mBERT-German-Mono/commit/a90080b2c90c631e2fd6e5212fbba343779a52a6'

In [None]:
german_model = AutoModelForSequenceClassification.from_pretrained("vidhur2k/multilingual-hate-speech/mBERT-German-Mono")
german_model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elemen

In [None]:
monolingual_test("German", test_loader, german_model)

Evaluating mBERT for German


  0%|          | 0/175 [00:00<?, ?it/s]

{'accuracy': 0.7932519741564967}


### Hindi

In [None]:
hindi_dataset = load_and_tokenize_dataset('https://raw.githubusercontent.com/vidhur2k/Multilngual-Hate-Speech/main/data/all-processed/B_hindi_processed.csv')

Using custom data configuration default-57b3179ba75ad551
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-57b3179ba75ad551/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/12000 [00:00<?, ?ex/s]

  0%|          | 0/3000 [00:00<?, ?ex/s]

In [None]:
train_loader = get_train_loader(hindi_dataset, batch_size = 8)

In [None]:
hindi_model = monolingual_train("Hindi", train_loader)

Training mBERT for Hindi


Downloading:   0%|          | 0.00/681M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model ch

  0%|          | 0/7500 [00:00<?, ?it/s]

In [None]:
test_loader = get_test_loader(hindi_dataset, batch_size=8)

In [None]:
monolingual_test("Hindi", test_loader, hindi_model)

Evaluating mBERT for Hindi


  0%|          | 0/375 [00:00<?, ?it/s]

{'accuracy': 0.8163333333333334}
F1 Score: 0.7337861381293204
AUC Score: 0.5


In [None]:
hindi_model.push_to_hub("vidhur2k/multilingual-hate-speech/mBERT-Hindi-Mono")

Cloning https://huggingface.co/vidhur2k/mBERT-Hindi-Mono into local empty directory.


Upload file pytorch_model.bin:   0%|          | 3.39k/679M [00:00<?, ?B/s]

To https://huggingface.co/vidhur2k/mBERT-Hindi-Mono
   ed50b97..5230fc3  main -> main



'https://huggingface.co/vidhur2k/mBERT-Hindi-Mono/commit/5230fc32f14bb9522c23d68c5e845da67f755d0a'

### Turkish

In [None]:
turkish_dataset = load_and_tokenize_dataset('https://raw.githubusercontent.com/vidhur2k/Multilngual-Hate-Speech/main/data/all-processed/B_turkish_processed.csv')

Using custom data configuration default-c8e15b8dbe9cd6f7
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-c8e15b8dbe9cd6f7/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached split indices for dataset at /root/.cache/huggingface/datasets/csv/default-c8e15b8dbe9cd6f7/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-30669d71210516ec.arrow and /root/.cache/huggingface/datasets/csv/default-c8e15b8dbe9cd6f7/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-e59662c9a3a0d98a.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-c8e15b8dbe9cd6f7/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-7c4a27da4bd67182.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-c8e15b8dbe9cd6f7/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-c8b1a748f31da5ca.arrow


['id', 'text', 'hs']


In [None]:
train_loader = get_train_loader(turkish_dataset, batch_size = 8)

In [None]:
turkish_model = monolingual_train("Turkish", train_loader)

Training mBERT for Turkish


Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model ch

  0%|          | 0/17400 [00:00<?, ?it/s]

In [None]:
test_loader = get_test_loader(turkish_dataset, batch_size=8)

In [None]:
monolingual_test("Turkish", test_loader, turkish_model)

Evaluating mBERT for Turkish


  0%|          | 0/870 [00:00<?, ?it/s]

{'accuracy': 0.8121856588590315}
F1 Score: 0.7280109973628176
AUC Score: 0.5


In [None]:
turkish_model.push_to_hub("vidhur2k/multilingual-hate-speech/mBERT-Turkish-Mono")