# Making things BERT friendly

Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having “bi-directional” attention, and are often called auto-encoding models.

The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence.

Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more generally word classification), and extractive question answering.

BERT, which stands for Bidirectional Encoder Representations from Transformers, is a cutting edge pretrained black box NLP model developed by Google.  

"Transformers are a multi-head attention mechanisms that learns contextual relations between the words in the given text.
Generally, a transformer consists of two separate parts - an encoder that accepts the text input and an optional
decoder or a sigmoid/softmax layer that produces a prediction for the task. BERT is a pre-training approach that uses
this architecture for modeling." [Toxic Comment Classification using Transformers](https://www.ieomsociety.org/singapore2021/papers/366.pdf)


The tokenizer returns a dictionary with three important items:

* input_ids are the indices corresponding to each token in the sentence.
* attention_mask indicates whether a token should be attended to or not.
* token_type_ids identifies which sequence a token belongs to when there is more than one sequence.

Special thanks to HARSH JAIN for their Kaggle article [BERT for "Everyone"](https://www.kaggle.com/code/harshjain123/bert-for-everyone-tutorial-implementation), from which much of the following code is excerpted.


First let's make the data compliant with BERT-
There is a very helpful function, encode_plus provided in the Tokenizer class which performs the following operations:

Tokenize the text
* Add special tokens - [CLS] and [SEP]
* create token IDs
* Pad the sentences to a common length
* Create attention masks for the above PAD tokens



We'll begin with our import statements:



In [None]:
! pip install accelerate
! pip install transformers

Collecting accelerate
  Downloading accelerate-0.23.0-py3-none-any.whl (258 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/258.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/258.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub (from accelerate)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface-hub, accelerate
Successfully installed accelerate-0.23.0 huggingface-hub-0.18.0
Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m51.1 MB/s[0m eta [3

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler, Dataset

from tqdm.notebook import tqdm

import transformers
from transformers import AutoTokenizer, TrainingArguments, Trainer, BertModel, pipeline, BertForSequenceClassification
from transformers import AdamW

#to avoid warnings
import warnings
warnings.filterwarnings('ignore')


## Train, Test Val split

In [None]:
#new df with binary data
train_b = train[['Toxic', 'comment_text']]

In [None]:
#Train test split
X_train_B, X_test_BERT = train_test_split(train_b, test_size=0.15, random_state=42)


In [None]:
 # Val split *Validation data abels for the test data could not be used per Kaggle "value of -1 indicates it was not used for scoring; (Note: file added after competition close!)"
X_train_BERT, X_val_BERT= train_test_split(X_train_B, test_size=0.05, random_state=42)

In [None]:
X_train_BERT.reset_index(drop=True, inplace=True)
X_val_BERT.reset_index(drop=True, inplace=True)
X_test_BERT.reset_index(drop=True, inplace=True)


## Check GPU
In order to run BERT will need the increased processing power of a GPU.  We can discover which (if any) processor we're using by running the following code:

In [None]:
"""# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
      raise SystemError('GPU device not found')
"""

Found GPU at: /device:GPU:0


## Tokenize Text and Convert to Tensors


In [None]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

In [None]:
# Load the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [None]:
MAX_LEN = 125
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
EPOCHS = 1

In [None]:
class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.comment_text = dataframe.comment_text
        self.targets = dataframe.Toxic
        self.max_len = MAX_LEN

    def __len__(self):
        return len(self.comment_text)

    def __getitem__(self, index):
        comment_text = str(self.comment_text[index])
        comment_text = " ".join(comment_text.split())

        inputs = self.tokenizer.encode_plus(
            comment_text,
            None,
            add_special_tokens=True,
            truncation=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )

        input_ids = inputs['input_ids']
        attention_mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]


        return [
            torch.tensor(input_ids, dtype=torch.long),
            torch.tensor(attention_mask, dtype=torch.long),
            torch.tensor(token_type_ids, dtype=torch.long),
            torch.tensor(self.targets[index], dtype=torch.float)
        ]

In [None]:
training_set = CustomDataset(X_train_BERT, tokenizer, MAX_LEN)
testing_set = CustomDataset(X_test_BERT, tokenizer, MAX_LEN)
val_set = CustomDataset(X_val_BERT, tokenizer, MAX_LEN)


In [None]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }
val_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)
val_loader = DataLoader(val_set, **val_params)

In [None]:
# Creating the customized model for sentiment analysis by
# adding a sigmoid activation function onto the end of the linear layer

class BERTClass(torch.nn.Module):
    def __init__(self):
        super(BERTClass, self).__init__()
        self.l1 = BertModel.from_pretrained('bert-base-uncased')
        self.l2 = torch.nn.Dropout(0.3)
        self.l3 = torch.nn.Linear(768,2)

    def forward(self, ids, mask, token_type_ids):
        _, output_1= self.l1(input_ids, attention_mask, token_type_ids, return_dict=False)
        output_2 = self.l2(output_1)
        output = self.l3(output_2)
        return output

model = BERTClass()
model.to(device)

BERTClass(
  (l1): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

In [None]:

training_args = TrainingArguments(
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate= 2e-5,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=VALID_BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model= "precision",
    output_dir="./",
)

In [None]:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=training_set,
    tokenizer=tokenizer,
    eval_dataset=val_set
)

In [None]:
trainer.train()


AttributeError: ignored

       Error caused by:
        # The model's main input name, usually `input_ids`, has be passed for padding


In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }


Trying to compute gradients for a tensor with multiple elements
Using the wrong loss function
Using the wrong activation function

## different approach

In [None]:
# create labels column
cols = ds["train"].column_names
ds = ds.map(lambda x : {"labels": [x[c] for c in cols if c != "comment_text"]})
ds

## Tokenize and encode

In [None]:
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [None]:
def tokenize_and_encode(examples):
  return tokenizer(examples["comment_text"], truncation=True)

In [None]:
cols = ds["train"].column_names
cols.remove("labels")
ds_enc = ds.map(tokenize_and_encode, batched=True, remove_columns=cols)
ds_enc

Loading cached processed dataset at /root/.cache/huggingface/datasets/jigsaw_toxicity_pred/default-2e028684d09fa340/1.1.0/7475ac9e42901c300e7d6f5ff9f1e234a46b3e90c377c1c900da4fd2f7738dbf/cache-13b55004b3b20c8d.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/jigsaw_toxicity_pred/default-2e028684d09fa340/1.1.0/7475ac9e42901c300e7d6f5ff9f1e234a46b3e90c377c1c900da4fd2f7738dbf/cache-1e115761ef12b984.arrow


DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels'],
        num_rows: 800
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'labels'],
        num_rows: 200
    })
})

In [None]:
class MultilabelTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        loss_fct = torch.nn.BCEWithLogitsLoss()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels),
                        labels.float().view(-1, self.model.config.num_labels))
        return (loss, outputs) if return_outputs else loss

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=num_labels).to('cuda')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [None]:
multi_trainer = MultilabelTrainer(
    model,
    args,
    train_dataset=ds_enc["train"],
    eval_dataset=ds_enc["test"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer)

In [None]:
multi_trainer.train()