<a href="https://colab.research.google.com/github/zrghassabi/LLM/blob/main/Chapter5_Solution1_DistilBERT_SST2_Transfer_Learning%5B1%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Challenge: Finetuning the Sentiment Analysis Model
Ready to elevate your machine learning expertise? In this challenge, you'll fine-tune a sentiment analysis model using DistilBERT and a sentiment analysis dataset. This exercise will empower you to enhance a pre-trained model's ability to accurately assess sentiment in text, a crucial skill in NLP applications.

Steps:
1. Load data:

Download and preprocess a sentiment analysis dataset, such as the SST-2 dataset, to prepare it for training.
2. Initialize model:

Load the pre-trained DistilBERT model and tokenizer from Hugging Face's Transformers library.
3. Prepare data for training:

Tokenize the dataset and create training and validation splits.
4. Fine-tune the model:

Train the DistilBERT model on the tokenized dataset, adjusting its parameters to learn sentiment classification.
5. Evaluate performance:

Assess the model's performance using metrics such as accuracy and F1 score to ensure it accurately predicts sentiment.
Conclusion:
By completing this challenge, you've gained hands-on experience in fine-tuning a sentiment analysis model. This forms a vital component of a comprehensive NLP solution, where sentiment analysis, translation, and Q&A capabilities work together to provide powerful, integrated AI applications.

# Sentiment Analysis with DistilBERT and SST2 Dataset
This notebook demonstrates how to perform sentiment analysis using a pre-trained DistilBERT model fine-tuned on the SST2 dataset from Hugging Face.

In [None]:

# Install necessary packages
!pip install transformers datasets tensorflow


Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Collecting requests (from transformers)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

## Import Libraries
We begin by importing the necessary libraries.

In [None]:

import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset


## Load the SST2 Dataset
Next, we load the SST2 dataset from Hugging Face.

In [None]:

# Load the SST2 dataset
dataset = load_dataset("stanfordnlp/sst2")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

## Load the DistilBERT Tokenizer
We load the pre-trained DistilBERT tokenizer to process the input text.

In [None]:

# Load the DistilBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

## Tokenize the Dataset
We define a function to tokenize the dataset and apply it to the SST2 dataset.

In [None]:

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding='max_length', truncation=True, max_length=128, return_tensors='tf')

tokenized_datasets = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

## Convert to TensorFlow Dataset
We convert the tokenized dataset to a format that can be used with TensorFlow.

In [None]:

# Convert the tokenized dataset to a TensorFlow dataset
train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["input_ids", "attention_mask"],
    label_cols="label",
    shuffle=True,
    batch_size=64
)

validation_dataset = tokenized_datasets["validation"].to_tf_dataset(
    columns=["input_ids", "attention_mask"],
    label_cols="label",
    shuffle=False,
    batch_size=64
)


In [None]:
for batch in train_dataset.take(1):
    print(batch[0]['input_ids'].shape)
    print(batch[0]['attention_mask'].shape)
    print(batch[1].shape)
    break

(64, 128)
(64, 128)
(64,)


## Load and Configure the DistilBERT Model
We load the pre-trained DistilBERT model and configure it for sequence classification.

In [None]:

# Load the pre-trained DistilBERT model
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

## Freeze DistilBERT Layers
We freeze the DistilBERT layers to focus training on the classification layer.

In [None]:

# Freeze the DistilBERT layers
model.layers[0].trainable = False


In [None]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0 (unused)
                                                                 
Total params: 66955010 (255.41 MB)
Trainable params: 592130 (2.26 MB)
Non-trainable params: 66362880 (253.15 MB)
_________________________________________________________________


## Compile the Model
We compile the model with appropriate loss function, optimizer, and metrics.

In [None]:

# Compile the model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)


## Train the Model
We train the model using the training dataset and validate it using the validation dataset.

In [None]:

# Train the model
model.fit(
    train_dataset,
    validation_data=validation_dataset,
    epochs=3
)


Epoch 1/3


Cause: for/else statement not yet supported


Cause: for/else statement not yet supported
Epoch 2/3
Epoch 3/3


<tf_keras.src.callbacks.History at 0x798c442a5cc0>

## Save the Model
We save the trained model for future use.

In [None]:

# Save the model
model.save_pretrained("./distilbert-sst2")


## Evaluate the Model
Finally, we evaluate the model using the test dataset to check its performance.

In [None]:

results = model.evaluate(validation_dataset)
print(f"Test loss: {results[0]}")
print(f"Test accuracy: {results[1]}")


Test loss: 0.36161550879478455
Test accuracy: 0.8337156176567078
