# Transfer learning

This notebook demonstrates how to fine-tunne Bert model with IMDB dataset. The aim is to predict with newly created model.

We will use **DistilBERT base model (uncased)** from Huggingface. https://huggingface.co/distilbert/distilbert-base-uncased

In [1]:
%pip install datasets



Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.8 MB/s[0m eta [36m0:00

In [2]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support



In [3]:
# this libraries allow us to use GPU (if available). if not, simply remove the code
import torch
torch.cuda.is_available()

True

In [4]:
# Check if GPU is available and use it
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [5]:
# Load pre-trained DistilBERT model and tokenizer
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
model.to(device)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



In [6]:
# Here we download dataset from Huggingface and remove "unsuperivsed" part as we will not use it.

from datasets import load_dataset

dataset = load_dataset('imdb')
del dataset['unsupervised']

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})

In [8]:
print(dataset['train'][1]) # Example of data point


{'text': '"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn\'t matter what one\'s political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn\'t true. I\'ve seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they don\'t exist. The same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which we\'re treated to the site of Vincent Gallo\'s throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) "double-standard" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude, 

In [9]:

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [10]:
# Prepare the data for PyTorch by setting format
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

In [11]:
# Define custom metric function
def compute_metrics(p):
    preds = p.predictions.argmax(-1)  # Get the predicted class
    labels = p.label_ids  # Get the true labels
    accuracy = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
    }

In [12]:
# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=16,  # batch size for training
    per_device_eval_batch_size=16,   # batch size for evaluation
    eval_strategy="epoch",           # evaluation strategy
    save_steps=1250,                 # save model every 1250 steps
    save_total_limit=2,              # only keep the last two checkpoints to save space
)

# Initialize Trainer with model, training arguments, datasets, and metrics
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=tokenized_datasets['train'],  # training dataset
    eval_dataset=tokenized_datasets['test'],    # evaluation dataset
    compute_metrics=compute_metrics,     # add the custom metric function
)


In [13]:
trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2303,0.217279,0.91636,0.947084,0.882,0.913384


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2303,0.217279,0.91636,0.947084,0.882,0.913384
2,0.1325,0.24399,0.9308,0.923749,0.93912,0.931371
3,0.0578,0.305836,0.93304,0.928583,0.93824,0.933386


TrainOutput(global_step=4689, training_loss=0.15523866408159545, metrics={'train_runtime': 4951.3967, 'train_samples_per_second': 15.147, 'train_steps_per_second': 0.947, 'total_flos': 9935054899200000.0, 'train_loss': 0.15523866408159545, 'epoch': 3.0})

In [14]:
eval_result = trainer.evaluate()
print(f"Evaluation results: {eval_result}")


Evaluation results: {'eval_loss': 0.3058363199234009, 'eval_accuracy': 0.93304, 'eval_precision': 0.9285827395091053, 'eval_recall': 0.93824, 'eval_f1': 0.9333863907680063, 'eval_runtime': 441.9344, 'eval_samples_per_second': 56.569, 'eval_steps_per_second': 3.537, 'epoch': 3.0}


**Evaluation results:** {

'eval_loss': 0.30790528655052185,

'eval_accuracy': 0.9326,

'eval_precision': 0.9334669338677355,

'eval_recall': 0.9316,

'eval_f1': 0.9325325325325325,

'eval_runtime': 412.7953,

'eval_samples_per_second': 60.563,

'eval_steps_per_second': 3.786,

'epoch': 3.0

}


In [None]:
# Save the model and tokenizer after training
model.save_pretrained('./results')
tokenizer.save_pretrained('./results')

## Using our new model to make predictions on a new sentence

In [17]:
# Load the fine-tuned model and tokenizer
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

# Load the model and tokenizer from the saved directory
model = DistilBertForSequenceClassification.from_pretrained('./results')
tokenizer = DistilBertTokenizer.from_pretrained('./results')

# Test with a manually input sentence
text = "it was so funny. amazing"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

# Put model in evaluation mode
model.eval()

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class_id = logits.argmax().item()

# Map predicted class ID to sentiment
if predicted_class_id == 1:
    print(f"Sentence: {text}\nPredicted Sentiment: Positive")
else:
    print(f"Sentence: {text}\nPredicted Sentiment: Negative")

Sentence: it was so funny. amazing
Predicted Sentiment: Positive
