# BERT

## Dataset

**IMDB Dataset (pt_br)** 

Dataset com avaliações de filmes em português. Usado para classificação binária de sentimentos: positivo ou negativo.

https://www.kaggle.com/datasets/luisfredgs/imdb-ptbr

In [1]:
caminho = 'imdb-ptbr.zip'

In [2]:
import pandas as pd

In [3]:
df_data = pd.read_csv(caminho)

df_data

Unnamed: 0,id,text_en,text_pt,sentiment
0,1,Once again Mr. Costner has dragged out a movie...,"Mais uma vez, o Sr. Costner arrumou um filme p...",neg
1,2,This is an example of why the majority of acti...,Este é um exemplo do motivo pelo qual a maiori...,neg
2,3,"First of all I hate those moronic rappers, who...","Primeiro de tudo eu odeio esses raps imbecis, ...",neg
3,4,Not even the Beatles could write songs everyon...,Nem mesmo os Beatles puderam escrever músicas ...,neg
4,5,Brass pictures movies is not a fitting word fo...,Filmes de fotos de latão não é uma palavra apr...,neg
...,...,...,...,...
49454,49456,"Seeing as the vote average was pretty low, and...","Como a média de votos era muito baixa, e o fat...",pos
49455,49457,"The plot had some wretched, unbelievable twist...",O enredo teve algumas reviravoltas infelizes e...,pos
49456,49458,I am amazed at how this movieand most others h...,Estou espantado com a forma como este filme e ...,pos
49457,49459,A Christmas Together actually came before my t...,A Christmas Together realmente veio antes do m...,pos


In [4]:
df_data['sentiment'].value_counts()

neg    24765
pos    24694
Name: sentiment, dtype: int64

In [5]:
possible_labels = df_data['sentiment'].unique()

label_dict = {}

for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

label_dict

{'neg': 0, 'pos': 1}

In [6]:
df_data['label'] = df_data['sentiment'].replace(label_dict)

In [7]:
df_data

Unnamed: 0,id,text_en,text_pt,sentiment,label
0,1,Once again Mr. Costner has dragged out a movie...,"Mais uma vez, o Sr. Costner arrumou um filme p...",neg,0
1,2,This is an example of why the majority of acti...,Este é um exemplo do motivo pelo qual a maiori...,neg,0
2,3,"First of all I hate those moronic rappers, who...","Primeiro de tudo eu odeio esses raps imbecis, ...",neg,0
3,4,Not even the Beatles could write songs everyon...,Nem mesmo os Beatles puderam escrever músicas ...,neg,0
4,5,Brass pictures movies is not a fitting word fo...,Filmes de fotos de latão não é uma palavra apr...,neg,0
...,...,...,...,...,...
49454,49456,"Seeing as the vote average was pretty low, and...","Como a média de votos era muito baixa, e o fat...",pos,1
49455,49457,"The plot had some wretched, unbelievable twist...",O enredo teve algumas reviravoltas infelizes e...,pos,1
49456,49458,I am amazed at how this movieand most others h...,Estou espantado com a forma como este filme e ...,pos,1
49457,49459,A Christmas Together actually came before my t...,A Christmas Together realmente veio antes do m...,pos,1


## Abordagem Feature-based

* Usa o modelo de linguagem para gerar os embeddings, o qual serve de entrada para o classificador.

* O classificador é treinado para gerar o modelo

In [8]:
# !pip install transformers

In [9]:
from transformers import BertTokenizer, BertModel

tokenizer_bertimbau = BertTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')

model_bertimbau = BertModel.from_pretrained('neuralmind/bert-base-portuguese-cased')

  from .autonotebook import tqdm as notebook_tqdm
Downloading (…)solve/main/vocab.txt: 100%|██████████████████████████████████████████| 210k/210k [00:00<00:00, 14.0MB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading (…)in/added_tokens.json: 100%|██████████████████████████████████████████████████| 2.00/2.00 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████| 112/112 [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 100%|██████████████████████████████████████████| 43.0/43.0 [00:00<00:00, 43.0kB/s]
Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████| 647/647 [00:00<00:00, 633kB/s]
Downloading pytorch_model.bin: 100%|█████████████████████████████████████████████████

In [10]:
model_bertimbau

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(29794, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [11]:
import torch

# conversão para tokens e identificação através do id
input_ids = torch.tensor(tokenizer_bertimbau.encode('Estudando modelos de linguagem contextualizados', add_special_tokens=True, max_length=512, truncation=True)).unsqueeze(0)

output = model_bertimbau(input_ids, output_hidden_states=True)

In [12]:
input_ids

tensor([[  101, 13025,   214,  4585,   125,  4616, 18880,  2066,  3833,   102]])

In [13]:
# não tokeniza por palavra mas sim pelo vocabulario (algoritmo wordpiece)

tokens = tokenizer_bertimbau.convert_ids_to_tokens(input_ids[0])
tokens

['[CLS]',
 'Estuda',
 '##ndo',
 'modelos',
 'de',
 'linguagem',
 'contex',
 '##tual',
 '##izados',
 '[SEP]']

In [14]:
len(output[0][0])
# 10 embeddings para 10 tokens

10

In [15]:
len(output[0][0][0])
# 768 posiçoes para cada embedding

768

In [16]:
# embedding de 1 token
output[0][0][0]

tensor([ 1.4259e-01,  4.0103e-01,  9.1296e-01,  2.8389e-01, -7.8744e-02,
         5.7209e-01,  6.0349e-01,  9.3607e-02,  1.6760e-01,  5.3643e-01,
        -1.6446e-01, -6.3722e-02, -4.0051e-01,  8.6854e-01,  8.3990e-02,
        -6.1482e-01,  2.9263e-01,  3.1990e-01, -1.9013e-01,  6.1508e-01,
         1.9000e-01, -3.3548e-01,  2.6052e-01, -4.5865e-01,  3.4804e-01,
         4.4601e-01, -3.1722e-01,  4.3954e-01,  1.2926e-01, -1.8389e-01,
        -1.8486e-01,  2.8515e-02, -4.8374e-01, -1.4387e-01,  1.2798e-01,
        -3.2165e-01,  5.3915e-01,  1.0934e-01,  2.6519e-01,  1.7941e-02,
        -2.1396e-01, -2.6194e-01, -4.8170e-01, -1.6534e-01, -5.4345e-01,
        -3.9526e-01, -7.3860e-02, -2.0253e-01, -4.4021e-01,  6.8335e-02,
         2.8324e-01, -1.1216e-01,  4.9492e-01,  4.8311e-01,  4.5922e-01,
        -5.0580e-02,  1.7564e-01, -3.7993e-01,  1.2108e-01,  5.1453e-01,
         3.5038e-01, -3.3191e-01,  9.0486e-03, -2.1669e-01, -2.5346e-02,
         2.2016e-01,  8.9177e-02, -1.1540e-01, -5.5

## Embeddings a partir do Sentence Transformers

Faz o embedding de toda a setença e não somente dos tokens

In [17]:
# !pip install -U sentence_transformers

In [18]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('neuralmind/bert-base-portuguese-cased')

model.max_seq_length = 512

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model.to(device)

Downloading (…)f0baa/.gitattributes: 100%|█████████████████████████████████████████████| 391/391 [00:00<00:00, 387kB/s]
Downloading (…)e10def0baa/README.md: 100%|████████████████████████████████████████| 3.60k/3.60k [00:00<00:00, 3.58MB/s]
Downloading (…)aa/added_tokens.json: 100%|██████████████████████████████████████████| 2.00/2.00 [00:00<00:00, 1.95kB/s]
Downloading (…)0def0baa/config.json: 100%|█████████████████████████████████████████████| 647/647 [00:00<00:00, 648kB/s]
Downloading pytorch_model.bin: 100%|████████████████████████████████████████████████| 438M/438M [01:26<00:00, 5.06MB/s]
Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████| 112/112 [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 100%|██████████████████████████████████████████| 43.0/43.0 [00:00<00:00, 43.0kB/s]
Downloading (…)e10def0baa/vocab.txt: 100%|██████████████████████████████████████████| 210k/210k [00:00<00:00, 6.98MB/s]
No sentence-transformers model found wit

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

In [19]:
model.tokenize(['Estudando modelos de linguagem contextualizados'])

{'input_ids': tensor([[  101, 13025,   214,  4585,   125,  4616, 18880,  2066,  3833,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [20]:
embedding_sentence = model.encode('Estudando modelos de linguagem contextualizados')
embedding_sentence

array([ 7.08376477e-03,  5.16501293e-02,  8.27929854e-01,  1.08459711e-01,
        1.51964694e-01,  2.96854526e-01, -1.50732393e-03, -1.26951247e-01,
       -4.76407148e-02,  5.82769793e-03,  2.24021241e-01,  5.75402454e-02,
       -2.36572579e-01,  2.78626978e-01, -3.14570107e-02, -4.04580206e-01,
        1.26978070e-01, -1.18314192e-01,  3.60915735e-02,  3.31545830e-01,
       -1.12002730e-01, -9.04822722e-02, -1.58047900e-01, -4.21076477e-01,
        2.17379570e-01,  3.76325846e-01, -3.26602086e-02,  1.42404899e-01,
       -1.34578526e-01, -5.89618683e-01,  2.05488563e-01, -3.94075327e-02,
       -3.70831072e-01, -7.68706352e-02,  2.41237327e-01, -1.71949580e-01,
        6.19709969e-01,  2.19713062e-01,  4.98909146e-01,  1.97408989e-01,
       -1.40037388e-01, -1.64626390e-01, -9.99341309e-02, -2.78383255e-01,
       -1.79463580e-01, -3.32419932e-01, -1.76172376e-01,  8.85227043e-03,
       -2.88362056e-01,  6.84021339e-02,  7.66360834e-02, -1.14154860e-01,
        1.86520964e-01,  

In [21]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model.to(device)

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

In [22]:
import tqdm
import numpy as np

In [23]:
X_embeddings = []
y_embeddings = []

for idx, row in tqdm.tqdm(df_data.iterrows()):
    

    embedding_texto = np.array(model.encode(row['text_pt']))

    X_embeddings.append(embedding_texto)
    y_embeddings.append(row['label'])

X_embeddings = np.array(X_embeddings)
y_embeddings = np.array(y_embeddings)

6776it [1:18:47,  1.43it/s]


KeyboardInterrupt: 

In [None]:
import xgboost as xgb
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, classification_report

In [None]:
# XGBoost

num_class = df_data['label'].nunique()

xgb = xgb.XGBClassifier(max_depth=4, n_estimators=1000, objective='multi:softmax', learning_rate=0.1, num_class=num_class)

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.20, random_state=0)
sss.get_n_splits(X_embeddings, y_embeddings)

for train_index, test_index in sss.split(X_embeddings, y_embeddings):

    X_train, X_test = X_embeddings[train_index], X_embeddings[test_index]
    y_train, y_test = y_embeddings[train_index], y_embeddings[test_index] 

xgb.fit(X_train, y_train)

print(classification_report(y_test, xgb.predict(X_test)))
print(confusion_matrix(y_test, xgb.predict(X_test)))

# Abordagem Fine-tuning

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification

num_class = df_data['label'].nunique()

tokenizer = BertTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')

model = BertForSequenceClassification.from_pretrained('neuralmind/bert-base-portuguese-cased',
                                                      num_labels=num_class,
                                                      output_attentions=False,
                                                      output_hidden_states=False)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_data.index.values,
                                                    df_data.label.values,
                                                    test_size=0.20,
                                                    random_state=0,
                                                    stratify=df_data.label.values)

df_data['data_type'] = ['not_set'] * df_data.shape[0]

df_data.loc[X_train, 'data_type'] = 'train'
df_data.loc[X_test, 'data_type'] = 'test'

df_data.groupby(['sentiment', 'label', 'data_type']).count()

In [None]:
# embedding eh gerado dentro da arquitetura
# sao gerados os ids dos tokens para servir de input no modelo

In [None]:
import torch
from transformers import BertTokenizer
from torch.utils.data import TensorDataset
                                          
encoded_data_train = tokenizer.batch_encode_plus(
    df_data[df_data.data_type=='train'].text_pt.values, 
    add_special_tokens=True, #[CLS] Sentença [SEP]
    return_attention_mask=True, 
    pad_to_max_length=True, #[PAD]
    max_length=512, 
    return_tensors='pt'
)

encoded_data_test = tokenizer.batch_encode_plus(
    df_data[df_data.data_type=='test'].text_pt.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=512, 
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df_data[df_data.data_type=='train'].label.values)

input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
labels_test = torch.tensor(df_data[df_data.data_type=='test'].label.values)

dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_test = TensorDataset(input_ids_test, attention_masks_test, labels_test)

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 8

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_test = DataLoader(dataset_test, 
                                   sampler=SequentialSampler(dataset_test), 
                                   batch_size=batch_size) 

In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)
                  
epochs = 10

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

In [None]:
from sklearn.metrics import f1_score
from sklearn import metrics
from sklearn.metrics import confusion_matrix

def f1_score_func(preds, labels, metric):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average=metric)

def f1_score_func_average(preds, labels, average_f1):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average=average_f1)

def accuracy_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return metrics.accuracy_score(labels_flat, preds_flat)

def classification_report_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    report = metrics.classification_report(labels_flat,preds_flat)
    print(report)

def matrix_confusion_class(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    print(confusion_matrix(labels_flat,preds_flat))

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

In [None]:
def evaluate(dataloader_test):

    model.eval()
    
    loss_test_total = 0
    predictions, true_test = [], []
    
    for batch in dataloader_test:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_test_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_test.append(label_ids)
    
    loss_test_avg = loss_test_total/len(dataloader_test) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_test = np.concatenate(true_test, axis=0)
            
    return loss_test_avg, predictions, true_test

In [None]:
import random
import numpy as np
import tqdm

seed_test = 42
random.seed(seed_test)
np.random.seed(seed_test)
torch.manual_seed(seed_test)
torch.cuda.manual_seed_all(seed_test)
device = 'cuda'

model.to(device)
    
for epoch in tqdm.tqdm(range(1, epochs+1)):   
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm.tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'/content/drive/MyDrive/UNIFOR/Tutorial BERT/fine_tuning/finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.tqdm.write(f'Training loss: {loss_train_avg}')
    
    test_loss, predictions, true_test = evaluate(dataloader_test)
    tqdm.tqdm.write(f'Test loss: {test_loss}')
    f1_micro = f1_score_func(predictions, true_test, 'micro')
    f1_macro = f1_score_func(predictions, true_test, 'macro')
    f1_weighted = f1_score_func(predictions, true_test, 'weighted')
    tqdm.tqdm.write(f'F1 Score (Micro): {f1_micro}')
    tqdm.tqdm.write(f'F1 Score (Macro): {f1_macro}')
    tqdm.tqdm.write(f'F1 Score (Weighted): {f1_weighted}')