# Text Classification with ParsBigBird



## What's ParsBigBird?

A Long-Range BERT for Self-supervised Learning of Language Representations for the Persian Language.<br>


The Bert and ParsBert algorithms can handle texts with token lengths of up to 512, however, many tasks such as summarizing and answering questions require longer texts. In our work, we have trained the Big Bird model for the Persian language to process texts up to 4096 in the Farsi (Persian) language using sparse attention.

# Dataset (DigiMag)
 - For this notebook, I'm going to use [DigiMag dataset](https://drive.google.com/file/d/1YgrCYY-Z0h2z0-PfWVfOGt1Tv0JDI-qz/view) for text classification
    - Train len: **6865**  , valid len:**767** , test len: **852**
    - It has **7** types for Magazines (7 classes)
    - Thanks **Hooshvare** for sharing this

- This is an example of how to use (You can use whatever dataset you have)

In [2]:
!nvidia-smi

Thu Oct 28 07:47:25 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
# Install required packages
!pip -q install datasets
!pip -q install transformers
!pip -q install sentencepiece
!pip -q install hazm
!pip -q install clean-text[gpl]

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
preprocessing 0.1.13 requires nltk==3.2.4, but you have nltk 3.3 which is incompatible.[0m


In [4]:
# Import required packages
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm 

from hazm import Normalizer
from hazm import WordTokenizer
from cleantext import clean
import re

## Loading
  - The dataset is [here](https://bit.ly/3ca4bm8) on Drive
  - Add it on your drive and use it like the following

In [6]:
# mount the google drive and change current dir
from google.colab import drive
drive.mount('./data.zip')
%cd /gdrive/MyDrive

In [7]:
# unzip the dataset in your drive
!unzip ./data.zip

Archive:  ./data.zip
   creating: digimag/
  inflating: digimag/dev.csv         
  inflating: digimag/train.csv       
  inflating: digimag/test.csv        


-  load train, test, and valid with Pandas

In [8]:
train_df = pd.read_csv('digimag/train.csv', delimiter="	", index_col=False)
eval_df = pd.read_csv('digimag/dev.csv', delimiter="	", index_col=False)
test_df = pd.read_csv('digimag/test.csv', delimiter="	", index_col=False)
# drop the label columns
train_df.drop(columns=['Unnamed: 0', 'label'], inplace=True)
eval_df.drop(columns=['Unnamed: 0', 'label'], inplace=True)
test_df.drop(columns=['Unnamed: 0', 'label'], inplace=True)
train_df.head()

Unnamed: 0,content,label_id
0,نمایش تبلیغ در لاک‌اسکرین تعدادی از گوشی‌های ه...,3
1,شکست Justice League در باکس آفیس پس از بازخورد...,5
2,کلاسیک بینی؛ همه چیز در یک شب اتفاق افتاد فیلم...,5
3,اپل دوباره سراغ رنده رفته چراکه آپگرید کردن سط...,3
4,بررسی جزء به جزء بهترین بخش Ori and the Blind ...,0


## Normalization (Preprocessing)


In [9]:
# for more details see hazm doc
tokenizer = WordTokenizer()
normalizer = Normalizer()

def cleaning(x):
    # clean html, css, js
    x = re.sub(re.compile('<.*?>'), '', x)
    x = re.compile(
        '<\s*style[^>]*>.*?<\s*/\s*style\s*>', re.S | re.I).sub('', x)
    x = re.compile(
        '<\s*script[^>]*>.*?<\s*/\s*script\s*>', re.S | re.I).sub('', x)
    # # regular cleaning
    x = clean(x,
              fix_unicode=True,
              to_ascii=False,
              lower=True,
              no_line_breaks=True,
              no_urls=True,
              no_emails=True,
              no_phone_numbers=True,
              no_numbers=False,
              no_digits=False,
              no_currency_symbols=True,
              no_punct=False,
              replace_with_url=" ",
              replace_with_email=" ",
              replace_with_phone_number=" ",
              replace_with_number=" ",
              replace_with_digit="",
              replace_with_currency_symbol=" ",
              )

    x = normalizer.normalize(x)
    # removing wierd patterns
    wierd_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u'\U00010000-\U0010ffff'
                               u"\u200d"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\u3030"
                               u"\ufe0f"
                               u"\u2069"
                               u"\u2066"
                               u"\u2068"
                               u"\u2067"
                               # u"\u200c"  # half space
                               "]+", flags=re.UNICODE)

    x = wierd_pattern.sub(r'', x)
    # add space between numbers
    x = re.sub('(\d+(\.\d+)?)', r' \1 ', x) 
    # remove #
    x = re.sub("#", " ", x)
    # remove @
    x = re.sub("@", " ", x)
    # bbb+ -> b
    x = re.sub(r'([آ-ی])\1{3,}', r'\1', x)
    x = re.sub(r'([.])\1{3,}', r'\1', x)
    # remove extra space
    x = re.sub("\s+", " ", x)
    return x.strip()

def text_preprocessor(t, tokenize=False):
    tokens = tokenizer.tokenize(cleaning(normalizer.normalize(t)))
    return tokens if tokenize else ' '.join(tokens)

In [10]:
# clean test set and valid set
test_df['cleaned'] = test_df.content.apply(text_preprocessor)
print('test cleaned')
eval_df['cleaned'] = eval_df.content.apply(text_preprocessor)
print('valid cleaned')

test cleaned
valid cleaned


In [11]:
# clean and tokenize train set
train_df['cleaned'] = train_df.content.apply(text_preprocessor)
print('train cleaned')
train_df['tokens'] = train_df.content.apply(text_preprocessor, tokenize=True)
train_df.head(3)

train cleaned


Unnamed: 0,content,label_id,cleaned,tokens
0,نمایش تبلیغ در لاک‌اسکرین تعدادی از گوشی‌های ه...,3,نمایش تبلیغ در لاک‌اسکرین تعدادی از گوشی‌های ه...,"[نمایش, تبلیغ, در, لاک‌اسکرین, تعدادی, از, گوش..."
1,شکست Justice League در باکس آفیس پس از بازخورد...,5,شکست justice league در باکس آفیس پس از بازخورد...,"[شکست, justice, league, در, باکس, آفیس, پس, از..."
2,کلاسیک بینی؛ همه چیز در یک شب اتفاق افتاد فیلم...,5,کلاسیک بینی ؛ همه چیز در یک شب اتفاق افتاد فیل...,"[کلاسیک, بینی, ؛, همه, چیز, در, یک, شب, اتفاق,..."


### Finding the max len

In [12]:
# compute the words len in each magazine
train_df['words_len'] = train_df.tokens.apply(lambda t: len(t))

In [13]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=train_df['words_len']))

fig.update_layout(
    title_text='Distribution of word counts within comments',
    xaxis_title_text='Word Count',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

In [14]:
minlim, maxlim = 30, 2048
# remove comments with the length of fewer than 30 words
print('size of the data before remove: ', len(train_df))
train_df['words_len'] = train_df['words_len'].apply(lambda len_t: len_t if minlim < len_t <= maxlim else None)
train_df = train_df.dropna(subset=['words_len'])
train_df = train_df.reset_index(drop=True)
print('size of the data after remove: ', len(train_df))

size of the data before remove:  6896
size of the data after remove:  6588


### Distribution of Classes

In [15]:
fig = go.Figure()
topic_freq = train_df.label_id.value_counts()
fig.add_trace(go.Bar(y=topic_freq, x=topic_freq.index.to_numpy()))
fig.update_layout(
    title_text='Distribution of each class',
    xaxis_title_text='classes',
    yaxis_title_text='Frequency',
    bargap=0.2,
    bargroupgap=0.2)

fig.show()

## Prepare data for Text Classification
- Using transformers tokenizer
- This tokenizer removes half-space and newlines in its tokenizer (you don't have to handle it in your cleaning)

In [16]:
from datasets import load_metric
from transformers import AutoTokenizer

In [17]:
lr = 2e-5 
weight_decay = 0.001
num_epochs = 2
batch_size = 4
accumulation_steps = 4
num_workers = 2
max_len = 2048 # much bigger than bert :) 
class_number = 7
metric = load_metric('f1')
model_name = "SajjadAyoubi/distil-bigbird-fa-zwnj"

Downloading:   0%|          | 0.00/1.96k [00:00<?, ?B/s]

In [18]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/500 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/134 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/292 [00:00<?, ?B/s]

- Tokenization

In [19]:
train_encodings = tokenizer(train_df.cleaned.tolist(), max_length=max_len, truncation=True, padding=True)
print('train tokenized')
test_encodings = tokenizer(test_df.cleaned.tolist(), max_length=max_len, truncation=True, padding=True)
print('test tokenized')
valid_encodings = tokenizer(eval_df.cleaned.tolist(), max_length=max_len, truncation=True, padding=True)
print('valid tokenized')

train tokenized
test tokenized
valid tokenized


- Create config for our classification task

In [20]:
from transformers import AutoConfig
# create a dict for classes
label2id = {label: i for i, label in enumerate(range(class_number))}
id2label = {v: k for k, v in label2id.items()}
config = AutoConfig.from_pretrained(model_name, **{'label2id': label2id, 'id2label': id2label})
print(f'label2id: {label2id}')
print(f'id2label: {id2label}')

Downloading:   0%|          | 0.00/837 [00:00<?, ?B/s]

label2id: {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6}
id2label: {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6}


# Training

In [21]:
import torch
from torch import optim
from torch.utils.data import DataLoader
from transformers import BigBirdForSequenceClassification
from transformers import default_data_collator

In [22]:
class DigiMagDs(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


# create datasets
train_ds = DigiMagDs(train_encodings, train_df.label_id)
valid_ds = DigiMagDs(valid_encodings, eval_df.label_id)
test_ds = DigiMagDs(test_encodings, test_df.label_id)

# create Dataloders
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True,
                      num_workers=num_workers, collate_fn=default_data_collator)

valid_dl = DataLoader(valid_ds, batch_size=batch_size, shuffle=True,
                      num_workers=num_workers, collate_fn=default_data_collator)

test_dl = DataLoader(test_ds, batch_size=batch_size, shuffle=True,
                     num_workers=num_workers, collate_fn=default_data_collator)

## PyTorch

In [24]:
import time, sys
from IPython.display import clear_output

class TorchTrainer:
    def __init__(self, model, train_dl, valid_dl, optimizer, device, accumulation_steps, compute_metric):
        self.model = model.to(device)
        self.device = device
        self.train_dl = train_dl
        self.valid_dl = valid_dl
        self.optimizer = optimizer
        self.compute_metric = compute_metric
        self.accumulation_steps = accumulation_steps
        self.loss_history = []

    def fit(self, num_epochs):
        clear_output()
        valid_acc = 0
        for epoch in range(num_epochs):
            print('Epoch %2d/%2d' % (epoch + 1, num_epochs))
            print('-' * 20)
            t0 = time.time()
            train_acc = self.train_model()
            valid_acc = self.valid_model()
            time_elapsed = time.time() - t0
            print('\n  Metrics: | train_met: %.3f | valid_met: %.3f |' % (train_acc[0], valid_acc[0]))
            print('\n  Epoch complete in: %.0fm %.0fs \n' % (time_elapsed // 60, time_elapsed % 60))
        return

    def train_model(self):
        self.model.train()
        N = len(self.train_dl.dataset)
        steps = N // (self.train_dl.batch_size*self.accumulation_steps) 
        step = 0
        avg_metric = 0.0
        for i, batch in enumerate(self.train_dl):
            input_ids = batch['input_ids'].to(self.device)
            attention_mask = batch['attention_mask'].to(self.device)
            labels = batch['labels'].to(self.device)
            outputs = self.model(input_ids, attention_mask=attention_mask, labels=labels)
            predictions = torch.argmax(outputs['logits'], dim=1)
            loss = outputs['loss'] 
            loss /= self.accumulation_steps   
            loss.backward()
            # weights update
            if ((i+1) % self.accumulation_steps == 0) or (step == steps):
                self.optimizer.step()
                self.optimizer.zero_grad()
           
                avg_metric += self.compute_metric(predictions, labels)
                self.loss_history.append(loss)
                sys.stdout.flush()
                sys.stdout.write("\r  Train_Step: %d/%d | runing_loss: %.4f" % (step, steps, loss))
                step += 1

        sys.stdout.flush()
        return torch.tensor([avg_metric]) / step

    def valid_model(self):
        print()
        self.model.eval()
        N = len(self.valid_dl.dataset)
        steps = N // self.valid_dl.batch_size
        avg_metric = 0.0
        with torch.no_grad():
            for i, batch in enumerate(self.valid_dl):
                input_ids = batch['input_ids'].to(self.device)
                attention_mask = batch['attention_mask'].to(self.device)
                labels = batch['labels'].to(self.device)
                outputs = self.model(input_ids, attention_mask=attention_mask, labels=labels)
                predictions = torch.argmax(outputs['logits'], dim=1)
                loss = outputs['loss']
                avg_metric += self.compute_metric(predictions, labels)
                sys.stdout.flush()
                sys.stdout.write("\r  Valid_Step: %d/%d | runing_loss: %.4f" % (i, steps, loss))

        sys.stdout.flush()
        return torch.tensor([avg_metric]) / steps

In [26]:
model = BigBirdForSequenceClassification.from_pretrained(model_name, config=config)
# model.gradient_checkpointing_enable() # for avoiding OOM

optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

Downloading:   0%|          | 0.00/314M [00:00<?, ?B/s]

Some weights of the model checkpoint at SajjadAyoubi/distil-bigbert-uncased were not used when initializing BigBirdForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BigBirdForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BigBirdForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BigBirdForSequenceClassification were not initialized from the model checkpoint 

In [27]:
def compute_metric(predictions, labels):
    return metric.compute(predictions=predictions, 
                          references=labels, average="weighted")['f1']

In [28]:
trainer = TorchTrainer(model, train_dl, valid_dl, optimizer=optimizer, 
                       device=device, accumulation_steps=accumulation_steps, 
                       compute_metric=compute_metric)

trainer.fit(num_epochs=num_epochs)

Epoch  1/ 2
--------------------



floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /usr/local/src/pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)



  Train_Step: 411/411 | runing_loss: 0.0206
  Valid_Step: 191/191 | runing_loss: 0.0067
  Metrics: | train_met: 0.886 | valid_met: 0.952 |

  Epoch complete in: 25m 55s 

Epoch  2/ 2
--------------------
  Train_Step: 411/411 | runing_loss: 0.0030
  Valid_Step: 191/191 | runing_loss: 0.0605
  Metrics: | train_met: 0.950 | valid_met: 0.959 |

  Epoch complete in: 25m 58s 



## HuggingFace 🤗
- You can also train it with HuggingFace Trainer [more details](https://huggingface.co/transformers/main_classes/trainer.html)

In [30]:
from transformers import TrainingArguments, Trainer

In [31]:
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels, average="weighted")

In [33]:
model = BigBirdForSequenceClassification.from_pretrained(model_name, config=config)

Some weights of the model checkpoint at SajjadAyoubi/distil-bigbert-uncased were not used when initializing BigBirdForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BigBirdForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BigBirdForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BigBirdForSequenceClassification were not initialized from the model checkpoint 

In [34]:
args = TrainingArguments(
    "ParsBigBirdDigiMag",
    evaluation_strategy="epoch",
    logging_strategy='epoch',
    save_strategy="no",
    learning_rate=lr,
    label_smoothing_factor=0.05,
    weight_decay=weight_decay,
    fp16=True,
    dataloader_num_workers=num_workers,
    gradient_accumulation_steps=accumulation_steps,
    gradient_checkpointing=True,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    report_to='none',
    )

In [35]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    data_collator=default_data_collator,
    compute_metrics=compute_metrics
    )

In [36]:
trainer.train()

Epoch,Training Loss,Validation Loss,F1,Runtime,Samples Per Second
0,0.5491,0.404824,0.94136,125.6715,6.103
1,0.3877,0.384333,0.95018,124.5989,6.156


TrainOutput(global_step=822, training_loss=0.46839184540611695, metrics={'train_runtime': 4203.0943, 'train_samples_per_second': 0.196, 'total_flos': 1.2798163811155968e+16, 'epoch': 2.0, 'init_mem_cpu_alloc_delta': -316198912, 'init_mem_gpu_alloc_delta': 318293504, 'init_mem_cpu_peaked_delta': 316198912, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 7413760, 'train_mem_gpu_alloc_delta': 965915648, 'train_mem_cpu_peaked_delta': 8192, 'train_mem_gpu_peaked_delta': 6561998336})

In [37]:
trainer.evaluate(test_ds)

{'eval_loss': 0.4253820478916168,
 'eval_f1': 0.9399713635714451,
 'eval_runtime': 138.3159,
 'eval_samples_per_second': 6.16,
 'epoch': 2.0,
 'eval_mem_cpu_alloc_delta': 94208,
 'eval_mem_gpu_alloc_delta': 0,
 'eval_mem_cpu_peaked_delta': 24576,
 'eval_mem_gpu_peaked_delta': 931051520}