**ESTE NOTEBOOK TIENE SU ORIGEN EN GOOGLE COLAB**

Este notebook es un fine-tunned hecho a un modelo de BERT para predecir ratings a partir de un dataset de reviews.

Logro una exactitud del 88% en test dataset. Predice ratings entre 1 a 5.

<hr>

**DESCARGAMOS LAS LIBRERIAS A UTILIZAR**

In [5]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.1-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m107.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90

<hr>

**CARGAMOS LA DATA DESDE GOOGLE DRIVE**

Para ello, debemos hacer la conexion o haber montado antes.

In [11]:
from datasets import load_from_disk

In [12]:
data = load_from_disk(r'/content/drive/MyDrive/datasets/test_data')

<hr>

**MODELO**

AutoTokenizer es una funcion que carga el tokenizer del modelo, y convierte nuestras reviews en inputs para el modelo.


DataCollatorWithPadding es una funcion que realiza un padding dinamico para reducir el tamaño de nuestros vectores.

TFAutoModelForSequenceClassification agrega una cabeza al transformer para la tarea de clasificacion. En el fondo, crea layers.

In [36]:
import tensorflow as tf

from transformers import AutoTokenizer

from transformers import DataCollatorWithPadding
from transformers import TFAutoModelForSequenceClassification

# Funcion de perdida para problema de clasificacion.
from tensorflow.keras.losses import SparseCategoricalCrossentropy

In [15]:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=5)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**CREAMOS LOS DATASET PARA EL MODELO**

In [16]:
tf_train_dataset = data["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["rating"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = data["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["rating"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


**EXTRA**

PolynomialDecay decrece en el runtime la tasa de aprendizaje. Esto impacta en el performance del modelo alrededor de un 27%

In [17]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.optimizers import Adam

In [18]:
batch_size = 8
num_epochs = 3
# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
num_train_steps = len(tf_train_dataset) * num_epochs
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
)

opt = Adam(learning_rate=lr_scheduler)

In [19]:
# Usamos accuracy como metrica de evaluacion.
model.compile(
    optimizer=opt,
    loss=SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"])

In [20]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  3845      
                                                                 
Total params: 109,486,085
Trainable params: 109,486,085
Non-trainable params: 0
_________________________________________________________________


In [21]:
# Entranamos en 3 epocas.
model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    epochs=3,
    verbose=True,
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f7b5b19d210>

In [25]:
model.config.label2id = {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4}

{'1': 0, '2': 1, '3': 2, '4': 3, '5': 4}

In [26]:
# Guardamos el modelo.
model.save_pretrained('/content/drive/MyDrive/datasets/bert_classification_alaska_88')

**EVALUATION**

In [35]:
import sklearn.metrics as skm
import numpy as np

In [58]:
y_pred = model.predict(tf_validation_dataset)
y_pred = np.argmax(tf.math.softmax(y_pred['logits'], axis=-1), axis=1)
y_true = np.array(data['validation']['rating'])

score = skm.accuracy_score(y_true=y_true, y_pred=y_pred)
print(f'Accuracy for the validation dataset: {score:.2f}')

score = skm.f1_score(y_true=y_true, y_pred=y_pred, average='macro')
print(f'F1-score for the validation dataset: {score:.2f}')


Accuracy for the validation dataset: 0.88
F1-score for the validation dataset: 0.81


In [56]:
tf_test_dataset = data["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["rating"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


In [59]:
y_pred = model.predict(tf_test_dataset)
y_pred = np.argmax(tf.math.softmax(y_pred['logits'], axis=-1), axis=1)
y_true = np.array(data['test']['rating'])

score = skm.accuracy_score(y_true=y_true, y_pred=y_pred)
print(f'Accuracy for the test dataset: {score:.2f}')

score = skm.f1_score(y_true=y_true, y_pred=y_pred, average='macro')
print(f'F1-score for the test dataset: {score:.2f}')

Accuracy for the test dataset: 0.88
F1-score for the test dataset: 0.80


**EJEMPLOS**

In [75]:
example_review = ['The service was really good.', 'Horrible horrible horrible, I will never come back again.', 'The food was okey, but the waiter was very disrespectful.']
example_inputs = tokenizer(example_review, padding=True, truncation=True, return_tensors='tf')
example_pred = np.argmax(model(**example_inputs)['logits'], axis=1)

for i in range(len(example_review)):
    print(f'Review: {example_review[i]} | Rating: {example_pred[i]}')

Review: The service was really good. | Rating: 3
Review: Horrible horrible horrible, I will never come back again. | Rating: 0
Review: The food was okey, but the waiter was very disrespectful. | Rating: 1
