# What we have here?

This notebook uses the [Hugging Face Transformers](https://huggingface.co/transformers/) library to create a BERT model for classification. This is just a prototype, so we will use a small dataset.

Note: As my computer is not very powerful, I used colab to train the model. You can use colab too, but you need to upload the dataset to your drive and change the path to the dataset.

### Installing and importing libraries

In [3]:
!pip install pytorch-transformers==1.2.0 torch==1.10.1 torchaudio==0.10.1 torchtext==0.11.1 torchvision==0.11.2 transformers==4.13.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.13.0
  Using cached transformers-4.13.0-py3-none-any.whl (3.3 MB)
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 5.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 44.2 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.10.3 transformers-4.13.0


In [4]:
import pandas as pd

import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.legacy.data import Field, TabularDataset, BucketIterator, Iterator
from transformers import BertTokenizer, BertForSequenceClassification


In [5]:
torch.cuda.empty_cache()

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


In [20]:
from google.colab import drive
drive.mount('drive')

source_folder = 'drive/MyDrive/Colab-Notebooks/prueba-bert'
destination_folder = 'drive/MyDrive/Colab-Notebooks/prueba-bert'

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


In [7]:
# tokenizer multilingual bert
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

In [9]:
from google.colab import files
uploaded = files.upload()

Saving test.json to test.json
Saving train.json to train.json
Saving valid.json to valid.json


### Tabular data

In [11]:
# Model parameter
MAX_SEQ_LEN = 128
PAD_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
UNK_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)

# Fields

label_field = Field(sequential=False, use_vocab=False, batch_first=True, dtype=torch.float)
text_field = Field(use_vocab=False, tokenize=tokenizer.encode, lower=False, include_lengths=False, batch_first=True,
                   fix_length=MAX_SEQ_LEN, pad_token=PAD_INDEX, unk_token=UNK_INDEX)
#fields = [('label', label_field), ('titular', text_field), ('cuerpo', text_field), ('titletext', text_field)] # use this if data is csv
fields = {'label': ('label', label_field),
          'titular': ('titular', text_field),
          'cuerpo': ('cuerpo', text_field),
          'titletext': ('titletext', text_field)} # use this if data is json

# TabularDataset

train, valid, test = TabularDataset.splits(path="", train="train.json", validation="valid.json",
                                           test='test.json', format='json', fields=fields, skip_header=True)

In [12]:
# look train data
print(vars(train[0]))

{'label': 1, 'titular': [101, 10159, 47982, 10281, 22725, 64427, 49861, 10198, 11639, 74831, 11410, 10125, 10193, 10104, 15761, 102], 'cuerpo': [101, 10159, 47982, 10281, 22725, 64427, 48932, 41931, 28841, 21051, 113, 47271, 11127, 114, 20942, 10547, 99638, 10107, 10125, 11639, 74831, 10104, 10846, 27605, 10171, 11410, 10125, 10193, 10104, 15761, 119, 14563, 14668, 117, 16507, 10153, 81547, 10104, 10218, 53565, 10171, 193, 11055, 106018, 117, 10402, 26883, 10171, 193, 10198, 22238, 14467, 117, 10121, 12923, 10661, 12042, 11990, 16085, 169, 16649, 22782, 119, 10883, 14806, 10104, 16186, 15498, 54284, 28107, 66558, 24541, 10216, 10125, 56708, 10104, 10285, 61006, 10104, 52616, 10107, 10220, 10109, 84688, 22381, 10219, 193, 10109, 10104, 31908, 11473, 10109, 29162, 81873, 117, 10121, 12418, 26219, 10567, 43342, 10107, 10183, 12944, 76194, 193, 10121, 42523, 10493, 86095, 19403, 11205, 10183, 10125, 53648, 119, 11518, 10398, 33972, 10104, 11639, 74831, 16037, 10110, 91191, 11205, 10110, 10

In [13]:
# Number of examples
print(f"Number of training examples: {len(train)}")
print(f"Number of testing examples: {len(test)}")
print(f"Number of validation examples: {len(valid)}")

Number of training examples: 15
Number of testing examples: 9
Number of validation examples: 65


### Iterators

In [14]:
# Iterators
batch_size = 8

#train_iter = BucketIterator(train, batch_size=batch_size, sort_key=lambda x: len(x.text),
train_iter = BucketIterator(train, batch_size=batch_size, sort_key=lambda x: len(x.cuerpo),
                            device=device, train=True, sort=True, sort_within_batch=True)
#valid_iter = BucketIterator(valid, batch_size=batch_size, sort_key=lambda x: len(x.text),
valid_iter = BucketIterator(valid, batch_size=batch_size, sort_key=lambda x: len(x.cuerpo),
                            device=device, train=True, sort=True, sort_within_batch=True)
test_iter = Iterator(test, batch_size=batch_size, device=device, train=False, shuffle=False, sort=False)

### Model

In [15]:
class BERT(nn.Module):

    def __init__(self):
        super(BERT, self).__init__()

        options_name = "bert-base-multilingual-cased"
        self.encoder = BertForSequenceClassification.from_pretrained(options_name)

    def forward(self, text, label):
        loss, text_fea = self.encoder(text, labels=label)[:2]

        return loss, text_fea

### Functions for training

In [16]:
# Save and Load Functions

def save_checkpoint(save_path, model, valid_loss):

    if save_path == None:
        return
    
    state_dict = {'model_state_dict': model.state_dict(),
                  'valid_loss': valid_loss}
    
    torch.save(state_dict, save_path)
    print(f'Model saved to ==> {save_path}')

def load_checkpoint(load_path, model):
    
    if load_path==None:
        return
    
    state_dict = torch.load(load_path, map_location=device)
    print(f'Model loaded from <== {load_path}')
    
    model.load_state_dict(state_dict['model_state_dict'])
    return state_dict['valid_loss']


def save_metrics(save_path, train_loss_list, valid_loss_list, global_steps_list):

    if save_path == None:
        return
    
    state_dict = {'train_loss_list': train_loss_list,
                  'valid_loss_list': valid_loss_list,
                  'global_steps_list': global_steps_list}
    
    torch.save(state_dict, save_path)
    print(f'Model saved to ==> {save_path}')


def load_metrics(load_path):

    if load_path==None:
        return
    
    state_dict = torch.load(load_path, map_location=device)
    print(f'Model loaded from <== {load_path}')
    
    return state_dict['train_loss_list'], state_dict['valid_loss_list'], state_dict['global_steps_list']

In [21]:
# Training Function

def train(model,
          optimizer,
          criterion = nn.BCELoss(),
          train_loader = train_iter,
          valid_loader = valid_iter,
          #num_epochs = 5,
          num_epochs = 20,
          eval_every = len(train_iter) // 2,
          file_path = destination_folder,
          best_valid_loss = float("Inf")):
    
    # initialize running values
    running_loss = 0.0
    valid_running_loss = 0.0
    global_step = 0
    train_loss_list = []
    valid_loss_list = []
    global_steps_list = []

    # training loop
    model.train()
    for epoch in range(num_epochs):
        #for (labels, title, text, titletext), _ in train_loader:
        for (labels, titular, cuerpo, titletext), _ in train_loader:
            labels = labels.type(torch.LongTensor)           
            labels = labels.to(device)
            titletext = titletext.type(torch.LongTensor)  
            titletext = titletext.to(device)
            output = model(titletext, labels)
            loss, _ = output

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # update running values
            running_loss += loss.item()
            global_step += 1

            # evaluation step
            if global_step % eval_every == 0:
                model.eval()
                with torch.no_grad():                    

                    # validation loop
                    for (labels, titular, cuerpo, titletext), _ in valid_loader:
                        labels = labels.type(torch.LongTensor)           
                        labels = labels.to(device)
                        titletext = titletext.type(torch.LongTensor)  
                        titletext = titletext.to(device)
                        output = model(titletext, labels)
                        loss, _ = output
                        
                        valid_running_loss += loss.item()

                # evaluation
                average_train_loss = running_loss / eval_every
                average_valid_loss = valid_running_loss / len(valid_loader)
                train_loss_list.append(average_train_loss)
                valid_loss_list.append(average_valid_loss)
                global_steps_list.append(global_step)

                # resetting running values
                running_loss = 0.0                
                valid_running_loss = 0.0
                model.train()

                # print progress
                print('Epoch [{}/{}], Step [{}/{}], Train Loss: {:.4f}, Valid Loss: {:.4f}'
                      .format(epoch+1, num_epochs, global_step, num_epochs*len(train_loader),
                              average_train_loss, average_valid_loss))
                
                # checkpoint
                if best_valid_loss > average_valid_loss:
                    best_valid_loss = average_valid_loss
                    save_checkpoint(file_path + '/' + 'model.pt', model, best_valid_loss)
                    save_metrics(file_path + '/' + 'metrics.pt', train_loss_list, valid_loss_list, global_steps_list)
    
    save_metrics(file_path + '/' + 'metrics.pt', train_loss_list, valid_loss_list, global_steps_list)
    print('Finished Training!')

In [22]:
model = BERT().to(device)
optimizer = optim.Adam(model.parameters(), lr=2e-5)

train(model=model, optimizer=optimizer)

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model ch

Epoch [1/20], Step [1/40], Train Loss: 0.6300, Valid Loss: 0.7089
Model saved to ==> drive/MyDrive/Colab-Notebooks/prueba-bert/model.pt
Model saved to ==> drive/MyDrive/Colab-Notebooks/prueba-bert/metrics.pt
Epoch [1/20], Step [2/40], Train Loss: 0.9681, Valid Loss: 0.6946
Model saved to ==> drive/MyDrive/Colab-Notebooks/prueba-bert/model.pt
Model saved to ==> drive/MyDrive/Colab-Notebooks/prueba-bert/metrics.pt
Epoch [2/20], Step [3/40], Train Loss: 0.5179, Valid Loss: 0.6892
Model saved to ==> drive/MyDrive/Colab-Notebooks/prueba-bert/model.pt
Model saved to ==> drive/MyDrive/Colab-Notebooks/prueba-bert/metrics.pt
Epoch [2/20], Step [4/40], Train Loss: 0.8326, Valid Loss: 0.6790
Model saved to ==> drive/MyDrive/Colab-Notebooks/prueba-bert/model.pt
Model saved to ==> drive/MyDrive/Colab-Notebooks/prueba-bert/metrics.pt
Epoch [3/20], Step [5/40], Train Loss: 0.5921, Valid Loss: 0.6735
Model saved to ==> drive/MyDrive/Colab-Notebooks/prueba-bert/model.pt
Model saved to ==> drive/MyDrive