# PyTorch BERT Model Training

## Getting Started

**szebr**: I just wanted to get this started up so that we can work on it over the weekend. I don't know what local environment everyone is using, but I'm a big fan of using VScode with the Jupyter extension. I also recommend checking out the [notebook](https://github.com/marcellusruben/medium-resources/blob/main/Text_Classification_BERT/bert_medium.ipynb) provided by the medium article and familiarizing yourself with PyTorch before adding onto this (as I am currently doing). We can go ahead and remove/edit this section of text as we refine this codebase.

## Setup

In [9]:
!pip3 install transformers
!pip3 install pandas
!pip3 install torch

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [10]:
import pandas as pd
import torch
import numpy as np
from transformers import BertTokenizer, BertModel
from torch import nn
from torch.optim import Adam
from tqdm import tqdm

In [11]:
datapath = "bbc-text.csv"
df = pd.read_csv(datapath)
df

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...
...,...,...
2220,business,cars pull down us retail figures us retail sal...
2221,politics,kilroy unveils immigration policy ex-chatshow ...
2222,entertainment,rem announce new glasgow concert us band rem h...
2223,politics,how political squabbles snowball it s become c...


## Text Processing

Following along with the tutorial provided by CGI, first we'll create a Dataset class that tokenizes our csv text fields, and pass our dataframe into this class.

In [12]:
# using imported pre-trained tokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

### Dataset class

I simplified this a little bit from the tutorial to keep it more readable.

In [13]:
# creating labels for use in Dataset class

categories = {'business': 0, 'entertainment': 1, 'sport': 2, 'tech': 3, 'politics': 4}

class Dataset(torch.utils.data.Dataset):
    
    # on init, map 'category' to id number in labels[]
    def __init__(self, df):
        self.labels = [categories[label] for label in df['category']]
        self.texts  = [tokenizer(text, padding = 'max_length', max_length = 512, truncation = True,
                                 return_tensors = "pt") for text in df['text']]
    
    def __len__(self):
        return len(self.labels)
    
    # return items for training
    def __getitem__(self, idx):
        
        text   = self.texts[idx]
        label  = torch.tensor(self.labels[idx], dtype=torch.long)
        
        return text, label
    
    

## Model Building

In [14]:
# THIS CODE IS TAKEN DIRECTLY FROM THE NOTEBOOK REFERENCED ABOVE FOR TESTING PURPOSES

class BertClassifier(nn.Module):

    def __init__(self, dropout=0.5):

        super(BertClassifier, self).__init__()

        self.bert = BertModel.from_pretrained('bert-base-cased')
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(768, 5)
        self.relu = nn.ReLU()

    def forward(self, input_id, mask):

        _, pooled_output = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)
        dropout_output = self.dropout(pooled_output)
        linear_output = self.linear(dropout_output)
        final_layer = self.relu(linear_output)
        
        return final_layer

## Model Training

In [8]:
# THIS CODE IS TAKEN DIRECTLY FROM THE NOTEBOOK REFERENCED ABOVE FOR TESTING PURPOSES

np.random.seed(303)
df_train, df_val, df_test = np.split(df.sample(frac=1, random_state=42), 
                                     [int(.8*len(df)), int(.9*len(df))])

def train(model, train_data, val_data, learning_rate, epochs):

    train, val = Dataset(train_data), Dataset(val_data)

    train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)
    val_dataloader = torch.utils.data.DataLoader(val, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr= learning_rate)

    if use_cuda:

            model = model.cuda()
            criterion = criterion.cuda()

    for epoch_num in range(epochs):

            total_acc_train = 0
            total_loss_train = 0

            for train_input, train_label in tqdm(train_dataloader):

                train_label = train_label.to(device)
                mask = train_input['attention_mask'].to(device)
                input_id = train_input['input_ids'].squeeze(1).to(device)

                output = model(input_id, mask)
                
                batch_loss = criterion(output, train_label.long())
                total_loss_train += batch_loss.item()
                
                acc = (output.argmax(dim=1) == train_label).sum().item()
                total_acc_train += acc

                model.zero_grad()
                batch_loss.backward()
                optimizer.step()
            
            total_acc_val = 0
            total_loss_val = 0

            with torch.no_grad():

                for val_input, val_label in val_dataloader:

                    val_label = val_label.to(device)
                    mask = val_input['attention_mask'].to(device)
                    input_id = val_input['input_ids'].squeeze(1).to(device)

                    output = model(input_id, mask)

                    batch_loss = criterion(output, val_label.long())
                    total_loss_val += batch_loss.item()
                    
                    acc = (output.argmax(dim=1) == val_label).sum().item()
                    total_acc_val += acc
            
            print(
                f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_data): .3f} \
                | Train Accuracy: {total_acc_train / len(train_data): .3f} \
                | Val Loss: {total_loss_val / len(val_data): .3f} \
                | Val Accuracy: {total_acc_val / len(val_data): .3f}')
                  
EPOCHS = 5
model = BertClassifier()
LR = 1e-6
              
train(model, df_train, df_val, LR, EPOCHS)

  return bound(*args, **kwds)
100%|██████████| 890/890 [14:17<00:00,  1.04it/s]


Epochs: 1 | Train Loss:  0.800                 | Train Accuracy:  0.246                 | Val Loss:  0.762                 | Val Accuracy:  0.342


100%|██████████| 890/890 [44:04<00:00,  2.97s/it]   


Epochs: 2 | Train Loss:  0.562                 | Train Accuracy:  0.645                 | Val Loss:  0.311                 | Val Accuracy:  0.946


100%|██████████| 890/890 [15:07<00:00,  1.02s/it]


Epochs: 3 | Train Loss:  0.205                 | Train Accuracy:  0.959                 | Val Loss:  0.125                 | Val Accuracy:  0.982


 72%|███████▏  | 642/890 [10:59<04:14,  1.03s/it]


KeyboardInterrupt: 

## Model Evaluation

In [1]:
# CODE TAKEN DIRECTLY FROM NOTEBOOK FOR TESTING PURPOSES

def evaluate(model, test_data):

    test = Dataset(test_data)

    test_dataloader = torch.utils.data.DataLoader(test, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:

        model = model.cuda()

    total_acc_test = 0
    with torch.no_grad():

        for test_input, test_label in test_dataloader:

              test_label = test_label.to(device)
              mask = test_input['attention_mask'].to(device)
              input_id = test_input['input_ids'].squeeze(1).to(device)

              output = model(input_id, mask)

              acc = (output.argmax(dim=1) == test_label).sum().item()
              total_acc_test += acc
    
    print(f'Test Accuracy: {total_acc_test / len(test_data): .3f}')
    
evaluate(model, df_test)

NameError: name 'model' is not defined