<div style="text-align: center;">
        <img src="./static/sentiment_header.png" width="570px" style="height: auto;"></img>
</div>

---

Let's investigate how we can use a pretrained BabyBERT model to detect the sentiment expressed by a sentence.

#### 📦 Importing dependencies

First, we'll import all the dependencies needed for this notebook.

In [3]:
from babybert.data import LanguageModelingDataset, load_dataset
from babybert.model import BabyBERT, BabyBERTForSentimentClassification
from babybert.tokenizer import WordPieceTokenizer
from babybert.trainer import Trainer, TrainerConfig

#### ⬆️ Loading our pretrained tokenizer and model

Let's load the pretrained tokenizer and BabyBERT model checkpoints from the previous two notebooks.

In [None]:
checkpoint_directory = "./checkpoints/toy-model"
tokenizer = WordPieceTokenizer.from_pretrained(checkpoint_directory)
model = BabyBERT.from_pretrained(checkpoint_directory)

#### 📚 Building our training dataset 

Now that we have our tokenizer and model prepared, we can start assembling our dataset. The sentiment classification dataset we will be using contains two rows for each sample. The first row contains the sample text, and the second row contains a binary label indicating whether the text has a positive or negative sentiment (0 for negative, 1 for positive).

In [4]:
dataset = load_dataset("./data/sentiment_classification.txt")

Let's check out the a few samples from the dataset:

In [25]:
for sentence, label in list(zip(*dataset.values()))[:3]:
    print(f"Sentence: {sentence}")
    print(f"Label: {label} ({'positive' if label else 'negative'})", end="\n\n")

Sentence: I love this movie so much!
Label: 1 (positive)

Sentence: This was the worst meal I've ever had.
Label: 0 (negative)

Sentence: What a fantastic day, everything went perfectly.
Label: 1 (positive)



Now that we have of fine-tuning dataset loaded, let's encode it using our tokenizer to obtain token IDs and attention masks.

Note that we pass in `model.config.block_size` for the `padding_length` argument. `block_size` specifies the length of the token sequences that our model expects, so we want to pad each sequence in our dataset to that length.

In [5]:
encoded = tokenizer.batch_encode(
    dataset["text"], padding_length=model.config.block_size
)

Finally, we'll use `LanguageModelingDataset.from_dict` to create a dataset object from the token IDs, attention masks, and sentiment labels.

In [6]:
training_dataset = LanguageModelingDataset.from_dict(
    {**encoded, "labels": dataset["label"]}
)

#### 💭 Setting up sentiment analysis head 

Right now, our `BabyBERT` model produces a contextual embedding for each token in the input sequence as output. However, for sentiment classification, we want class predictions as output - one for the negative class, one for the positive class.

To do this, we can use the `BabyBERTForSentimentClassification` class like so:

In [None]:
sentiment_analysis_model = BabyBERTForSentimentClassification(model)

#### 💪 Instantiating the trainer

Next, let's configure and instantiate our trainer. For the sake of example, we'll train on a minute number of samples. In a production setting, we would want to train on far more samples than this over multiple epochs.

In [8]:
trainer_cfg = TrainerConfig(batch_size=16, num_workers=4, num_samples=1000)

trainer = Trainer(sentiment_analysis_model, trainer_cfg)

#### 🏋️ Fine-tuning BabyBERT

At this point, we have all the pieces in place. Let's fine-tune our model!

In [9]:
trainer.run(training_dataset)

Training: 100%|[33m██████████[0m| 1008/1008 [03:40<00:00,  4.57samples/s, loss=0.6802]
