<a href="https://colab.research.google.com/github/robertjprior/DeepLearning_CodeTemplate_MLOps/blob/main/Review_Sentiment_Classification_DistilBert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m77.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m123.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90

In [2]:
from tqdm import tqdm

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [4]:
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

For performance reasons, we'll only use 2,000 sentences from the dataset

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model.

In [229]:
TEST_SIZE = 0.1
batch_size = 20
DROPOUT = 0.5
NUM_LABELS = 2
learning_rate = 1e-3
num_train_epochs = 20

In [230]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'


In [231]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights, return_dict=True)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [232]:
import torch
from torch import nn

class DistilBertClassifier(nn.Module):
    def __init__(self, pretrained_model, num_labels=NUM_LABELS, dropout=DROPOUT, averaging = "last four", ):
        super(DistilBertClassifier, self).__init__()
        self.num_labels = num_labels
        self.averaging = averaging

        self.dropout = nn.Dropout(dropout)
        self.bert = pretrained_model #RobertaModel.from_pretrained("roberta-base", return_dict=True)
        self.hidden_size = self.bert.config.hidden_size

        self.dense = nn.Linear(self.hidden_size, self.hidden_size) #https://github.com/google-research/bert/issues/43
        #https://discuss.huggingface.co/t/what-is-the-purpose-of-the-additional-dense-layer-in-classification-heads/526
        self.linear = nn.Linear(self.hidden_size, num_labels)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax()
        if self.averaging == "last four":
            self.hidden_size = self.hidden_size *4
            self.dense = nn.Linear(self.hidden_size, self.hidden_size) #https://github.com/google-research/bert/issues/43
            self.linear = nn.Linear(self.hidden_size, self.num_labels)


    def forward(self, input_ids, attention_mask):

        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=True, return_dict=True)
        if self.averaging=="last":
            #average across this middle
            sentence_representation = torch.mean(outputs['last_hidden_state'], 1)
            #TODO: should ultimately try to avoid padding tokens https://stackoverflow.com/questions/71434804/how-to-fed-last-4-concatenated-hidden-layers-of-bert-to-fc-layers
        elif self.averaging == "last four":
            feature_layers = outputs['hidden_states'][-4:]
            sentence_representation = torch.cat(feature_layers, -1) #concatenate them (here over the last dimension) to a single tensor of shape (batch_size, seq_len, 4 * hidden_size)
            #sentence_representation = torch.mean(sentence_representation, 1)

            #alternative that avoids taking the mean of paddings in there
            sentence_representation = sentence_representation[:,0,:]
            #sentence_representation.size() #torch.Size([20, 3072])
            #sentence_representation = sentence_representation.to(self.device)



        else: #if none
            sentence_representation = outputs['last_hidden_state'][:, 0, :] #cls token
        x = self.dropout(sentence_representation)
        x = self.dense(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.linear(x)
        #x = self.softmax(x)
        return x

In [233]:
model = DistilBertClassifier(pretrained_model = model, num_labels = NUM_LABELS, dropout = DROPOUT, averaging = "last four")

In [234]:
model = model.to(device)

In [235]:
#validate everything is on the device
#print(device)
#for param in model.parameters():
#    print(type(param), param.size(), param.device)


## Data -> Dataloader #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to do some minimal processing to put them in the format it requires. Includes 1) splitting data into train validate test sets, 2) loading them into a Dataset format so they can be saved to disk until their batch is called, 3) tokenization, 4) add the label in correct format to the dataset, 5) create a data collator that will do padding only when dataset is called to save RAM, 6) dataloader is setup that will run the collator function too when the batch is called



In [236]:
df.columns = ['text', 'label']

In [237]:
def create_tokenized_datasets(tokenizer, datafile_name, label_col_name, text_col_name):
    """returns a huggingface DatasetDict object with train, validate, and test columns. \
    There should also be two columns in each dataset we are interested in "text" and "label" \
    Trainer object will automatically move things to a tensor as needed for us. """


    #load dataset class object
    df, labels = pytorch_dataset(datafile_name, label_col_name)
    #transform dataset label

    def tokenize_function(example):
        #old handling: tokenized_outputs = tokenizer(text, return_tensors="pt")
        tokens = tokenizer(example[text_col_name], truncation=True, padding=False)
        tokens['labels'] = labels.str2int(example[label_col_name])
        return tokens

    #tokenize dataset (doing it this way so the results get pushed back as new columns in Datasets format stored on Disk instead of returning dictionary stored in RAM)
    tokenized_datasets = df.map(tokenize_function, batched=True)
    return tokenized_datasets



def pytorch_dataset(filename, label_col_name):
    from datasets import Dataset, DatasetDict, ClassLabel
    train, validate, test, labels_set = optimization_read_split_data(
        df = filename,
        test_size=TEST_SIZE,
        label_col_name=label_col_name,
    )
    train = Dataset.from_pandas(train)
    validate = Dataset.from_pandas(validate)
    test = Dataset.from_pandas(test)
    dataset = DatasetDict({
        "train": train,
        "validate": validate,
        "test": test})
    labels = ClassLabel(names = list(labels_set))
    return dataset, labels

def optimization_read_split_data(df, test_size, label_col_name):
    #df = pd.read_csv(path)
    #y = df[label_col_name].astype(str)
    #df = df.drop(columns=[label_col_name])

    #X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=test_size, stratify=y)
    #X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=test_size, stratify=y_train)
    #return X_train.values.tolist(), X_val.values.tolist(), X_test.values.tolist(), y_train, y_val, y_test
    train, test = train_test_split(df, test_size=test_size, stratify=df[label_col_name])
    train, validate = train_test_split(train, test_size=test_size, stratify=train[label_col_name])
    labels = set(df[label_col_name])
    return train, validate, test, labels
def calcuate_accuracy(preds, targets):
    n_correct = (preds==targets).sum().item()
    return n_correct

In [238]:
tokenized_df = create_tokenized_datasets(tokenizer, df, 'label', 'text')
tokenized_df = tokenized_df.remove_columns(["text", "label", "__index_level_0__"])

Map:   0%|          | 0/5605 [00:00<?, ? examples/s]

Map:   0%|          | 0/623 [00:00<?, ? examples/s]

Map:   0%|          | 0/692 [00:00<?, ? examples/s]

In [239]:
from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding = True)

train_loader = DataLoader(tokenized_df['train'], collate_fn=data_collator, batch_size=batch_size)
validate_loader = DataLoader(tokenized_df['validate'], collate_fn=data_collator, batch_size=batch_size)
test_loader = DataLoader(tokenized_df['test'], collate_fn=data_collator, batch_size=batch_size)

#show collator working
#example = tokenizer(df['text'][1], truncation=True, padding = False)
#example2 = data_collator(example)
#example2['input_ids'].shape
#print(len(example['input_ids']))
#print(example2['input_ids'].shape)

In [240]:
#to view a batch and validate padding
#next(iter(train_loader))

#tokenizer.decode([0])

Example walkthough of the base bert model

In [241]:
# batch = next(iter(train_loader))
# with torch.no_grad():
#     output = model(batch['input_ids'], batch['attention_mask'], output_hidden_states=True)

# output.last_hidden_state.size() #(batch_size, sequence_length, hidden_size)
# print(len(output.hidden_states))
# print(output.hidden_states[0].size())

# feature_layers = output['hidden_states'][-4:]
# sentence_representation = torch.cat(feature_layers, -1) #concatenate them (here over the last dimension) to a single tensor of shape (batch_size, seq_len, 4 * hidden_size)
# #sentence_representation = torch.mean(sentence_representation, 1)

# #alternative that avoids taking the mean of paddings in there
# sentence_representation = sentence_representation[:,0,:]
# #sentence_representation.size() #torch.Size([20, 3072])

In [242]:
# #and the finetuning bert model
# batch = next(iter(train_loader))
# with torch.no_grad():
#     output = model.forward(batch['input_ids'], batch['attention_mask'])
# output


In [243]:
loss_function = torch.nn.CrossEntropyLoss()
#optimizer = torch.optim.Adam(params = model.parameters(), lr=learning_rate)
optimizer = torch.optim.SGD(params = model.parameters(), lr=learning_rate)



trial = None

In [244]:
#Need to enable my custom function to go to cuda

In [245]:
for epoch in range(num_train_epochs):
    #print(f"Epoch #: {epoch}")
    tr_loss = 0
    n_correct = 0
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.train()
    for batch_idx, data in enumerate(tqdm(train_loader)):
        optimizer.zero_grad()

        ids = data['input_ids'].to(device, dtype = torch.long)
        mask = data['attention_mask'].to(device, dtype = torch.long)
        #token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['labels'].to(device, dtype = torch.long)
        #print(model.is_cuda)

        outputs = model(ids, mask)
        loss = loss_function(outputs, targets)
        tr_loss += loss.item()
        big_val, big_idx = torch.max(outputs.data, dim=1)
        n_correct += calcuate_accuracy(big_idx, targets)

        nb_tr_steps += 1
        nb_tr_examples+=targets.size(0)

        if (batch_idx%1000==0) and (batch_idx != 0):
            print(batch_idx)
            loss_step = tr_loss/nb_tr_steps
            accu_step = (n_correct*100)/nb_tr_examples
            print(f"Training Loss per 5000 steps: {loss_step}")
            print(f"Training Accuracy per 5000 steps: {accu_step}")


        loss.backward()
        optimizer.step()

    # Validation of the model.
    model.eval()
    correct = 0
    eval_nb_tr_examples = 0
    val_loss = 0

    with torch.no_grad():
        for batch_idx, data in enumerate(tqdm(validate_loader)):
            ids = data['input_ids'].to(device, dtype = torch.long)
            mask = data['attention_mask'].to(device, dtype = torch.long)
            #token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            targets = data['labels'].to(device, dtype = torch.long)
            outputs = model(ids, mask)
            loss = loss_function(outputs, targets)
            val_loss += loss.item()
            # Get the index of the max log-probability.
            pred = outputs.argmax(dim=1, keepdim=True)
            correct += pred.eq(targets.view_as(pred)).sum().item()
            eval_nb_tr_examples+=targets.size(0)
    #if not epoch%10:
    accuracy = correct / eval_nb_tr_examples

    print(
        f"Epoch: {epoch:02d} | "
        f"train_loss: {tr_loss:.5f}, "
        f"train_accuracy: {(n_correct*100)/nb_tr_examples:.5f}, "
        f"val_loss: {val_loss:.5f}, "
        f"val accuracy: {accuracy:.5f}")



100%|██████████| 281/281 [00:20<00:00, 13.71it/s]
100%|██████████| 32/32 [00:00<00:00, 35.75it/s]


Epoch: 00 | train_loss: 192.15125, train_accuracy: 55.28992, val_loss: 20.76897, val accuracy: 0.71589


100%|██████████| 281/281 [00:21<00:00, 12.86it/s]
100%|██████████| 32/32 [00:00<00:00, 32.44it/s]


Epoch: 01 | train_loss: 165.83324, train_accuracy: 72.07850, val_loss: 13.97613, val accuracy: 0.83628


100%|██████████| 281/281 [00:23<00:00, 12.16it/s]
100%|██████████| 32/32 [00:00<00:00, 33.03it/s]


Epoch: 02 | train_loss: 118.04937, train_accuracy: 81.74844, val_loss: 10.21156, val accuracy: 0.86356


100%|██████████| 281/281 [00:21<00:00, 12.81it/s]
100%|██████████| 32/32 [00:00<00:00, 34.73it/s]


Epoch: 03 | train_loss: 98.69681, train_accuracy: 85.42373, val_loss: 9.39156, val accuracy: 0.87159


100%|██████████| 281/281 [00:21<00:00, 13.01it/s]
100%|██████████| 32/32 [00:00<00:00, 34.13it/s]


Epoch: 04 | train_loss: 86.59739, train_accuracy: 87.36842, val_loss: 9.12036, val accuracy: 0.88122


100%|██████████| 281/281 [00:22<00:00, 12.77it/s]
100%|██████████| 32/32 [00:00<00:00, 33.35it/s]


Epoch: 05 | train_loss: 79.06107, train_accuracy: 88.63515, val_loss: 9.03436, val accuracy: 0.88604


100%|██████████| 281/281 [00:22<00:00, 12.66it/s]
100%|██████████| 32/32 [00:00<00:00, 33.32it/s]


Epoch: 06 | train_loss: 70.96549, train_accuracy: 90.24086, val_loss: 8.92547, val accuracy: 0.89246


100%|██████████| 281/281 [00:21<00:00, 12.80it/s]
100%|██████████| 32/32 [00:00<00:00, 33.62it/s]


Epoch: 07 | train_loss: 66.16977, train_accuracy: 91.11508, val_loss: 8.85084, val accuracy: 0.89085


100%|██████████| 281/281 [00:21<00:00, 12.84it/s]
100%|██████████| 32/32 [00:00<00:00, 33.75it/s]


Epoch: 08 | train_loss: 59.61952, train_accuracy: 92.43533, val_loss: 8.81275, val accuracy: 0.89246


100%|██████████| 281/281 [00:21<00:00, 12.81it/s]
100%|██████████| 32/32 [00:00<00:00, 34.18it/s]


Epoch: 09 | train_loss: 52.45029, train_accuracy: 93.34523, val_loss: 9.54955, val accuracy: 0.88764


100%|██████████| 281/281 [00:21<00:00, 12.79it/s]
100%|██████████| 32/32 [00:00<00:00, 33.99it/s]


Epoch: 10 | train_loss: 48.48669, train_accuracy: 93.45227, val_loss: 9.75816, val accuracy: 0.88925


100%|██████████| 281/281 [00:22<00:00, 12.75it/s]
100%|██████████| 32/32 [00:00<00:00, 33.69it/s]


Epoch: 11 | train_loss: 43.24334, train_accuracy: 94.77252, val_loss: 9.89108, val accuracy: 0.88283


100%|██████████| 281/281 [00:22<00:00, 12.73it/s]
100%|██████████| 32/32 [00:01<00:00, 31.89it/s]


Epoch: 12 | train_loss: 38.83653, train_accuracy: 95.32560, val_loss: 10.64778, val accuracy: 0.88764


100%|██████████| 281/281 [00:22<00:00, 12.63it/s]
100%|██████████| 32/32 [00:00<00:00, 33.92it/s]


Epoch: 13 | train_loss: 34.36444, train_accuracy: 95.68243, val_loss: 10.84388, val accuracy: 0.88443


100%|██████████| 281/281 [00:22<00:00, 12.71it/s]
100%|██████████| 32/32 [00:00<00:00, 33.80it/s]


Epoch: 14 | train_loss: 31.11783, train_accuracy: 96.32471, val_loss: 10.73536, val accuracy: 0.88925


100%|██████████| 281/281 [00:22<00:00, 12.73it/s]
100%|██████████| 32/32 [00:00<00:00, 33.69it/s]


Epoch: 15 | train_loss: 25.60192, train_accuracy: 97.14541, val_loss: 11.74577, val accuracy: 0.88443


100%|██████████| 281/281 [00:22<00:00, 12.72it/s]
100%|██████████| 32/32 [00:00<00:00, 33.59it/s]


Epoch: 16 | train_loss: 22.64109, train_accuracy: 97.53791, val_loss: 12.27477, val accuracy: 0.88604


100%|██████████| 281/281 [00:22<00:00, 12.55it/s]
100%|██████████| 32/32 [00:00<00:00, 33.40it/s]


Epoch: 17 | train_loss: 21.38863, train_accuracy: 97.37734, val_loss: 12.16292, val accuracy: 0.88925


100%|██████████| 281/281 [00:22<00:00, 12.71it/s]
100%|██████████| 32/32 [00:00<00:00, 33.93it/s]


Epoch: 18 | train_loss: 18.32141, train_accuracy: 97.96610, val_loss: 12.97807, val accuracy: 0.88443


100%|██████████| 281/281 [00:22<00:00, 12.74it/s]
100%|██████████| 32/32 [00:00<00:00, 33.72it/s]

Epoch: 19 | train_loss: 15.33448, train_accuracy: 98.32293, val_loss: 13.48455, val accuracy: 0.88443



