# FINE-TUNING LLM MODLES for SENTIMENTAL ANALYSIS

- Based-ON: https://huggingface.co/blog/sentiment-analysis-python
- Using HuggingFace
- LLM: "distilbert-base-uncased"
- Fine-Tune Dataset: "imdb"
- Train Dataset Size: 25000 Reviews
- Test Dataset Size: 25000 Reviews

##Results:
- Based on 25K/25K and 2 Epochs
- Base model produced 49% Accuracy
- Fine-Tuned (only 2 Epochs on all data) prodcued 93% Accuracy

##ToDoS
- Run the test on 10+ Epochs
- Train with LoRA, QLoRA, ...
- Train using big model in Model Distributed mode
- Train with Data Distributed mode
- Train Big Model with LoRA in Model and Data Distributed modes


###Notes:
- once you fine-tune it overwrites the weights of the model specicified in Trainer. thus you need to save a "model-base" copy for comparison analysis or capture or required tests on the HuggingFace model before Fine-Tuning


In [None]:
# Install required libraries
!pip install datasets transformers huggingface_hub
!apt-get install git-lfs
!pip install accelerate -U
!pip install evaluate

In [47]:
# Activate GPU for faster training by clicking on 'Runtime' > 'Change runtime type' and then selecting GPU as the Hardware accelerator
# Then check if GPU is available
import torch
torch.cuda.is_available()

True

#2. Preprocess data

In [48]:
# Load data
from datasets import load_dataset
imdb = load_dataset("imdb")

# Create a smaller training dataset for faster training times
# The dataset has 25000 Train and 25000 Test cases. For speed of engineering we selelct subsert of those.

train_count = 25000 #3000
test_count = 25000 #300

small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(train_count))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(test_count))])

print(small_train_dataset[0])
print(small_test_dataset[0])

{'text': 'There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier\'s plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it\'s the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...', 'label': 1}
{'text': "<br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could go wrong?<br /><br />Very quickly, 

In [49]:
# Check details of the dataset
#small_test_dataset, small_test_dataset[0], small_test_dataset.description, small_test_dataset.info, dir(small_test_dataset),

In [50]:
# Set DistilBERT tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Prepare the text inputs for the model
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [51]:
# Explore Tokenized Data

print(tokenized_test)
#print(tokenized_train[0])
print(tokenized_train[0].keys()) # instance is a dictionary
print(len(tokenized_train[0]["text"])) #imput must be truncated to 758 tokens
#dir(tokenized_train)

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 25000
})
dict_keys(['text', 'label', 'input_ids', 'attention_mask'])
758


In [52]:
# Use data_collector to convert our samples to PyTorch tensors and concatenate them with the correct amount of padding
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [53]:
# Explore Collator
data_collator

DataCollatorWithPadding(tokenizer=DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')

# 3. Training the model

In [54]:
# Define DistilBERT as our base model:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2) # Trained for num_labels=2, You can use it with more classes but needs training.
model_base = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2) # Trained for num_labels=2, You can use it with more classes but needs training.

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [55]:
# Explore Model
print("model_base", model_base)
#dir(model)

# Evaluate Model befre Fine-Tuning
from evaluate import evaluator
task_evaluator = evaluator("text-classification")

eval_results = task_evaluator.compute(model_or_pipeline=model_base,
                                      data=small_test_dataset,
                                      metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]),
                                      label_mapping={"LABEL_0": 0, "LABEL_1": 1},
                                      tokenizer = tokenizer)

print("Evaluation Results", "\nACCURACY:", eval_results["accuracy"], "\nF1:", eval_results["f1"])

model_base DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inpla

# RESULTS ANALYSIS

##Train Sample Size = 3000, Test Sample Size = 300

###Evaluation Results

ACCURACY: 0.5166666666666667

F1: 0.39834024896265563

##Train Sample Size = 25000, Test Sample Size = 25000

###Evaluation Results

ACCURACY: 0.48936

F1: 0.1417238133656044

In [56]:
# Define the evaluation metrics
import numpy as np
from datasets import load_metric # DEPRECIATION WARNINGS!

def compute_metrics(eval_pred):
    load_accuracy = load_metric("accuracy")
    load_f1 = load_metric("f1")

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}

In [37]:
# Log in to your Hugging Face account
# Get your API token here https://huggingface.co/settings/token
#from huggingface_hub import notebook_login
#notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [57]:
# Define a new Trainer with all the objects we constructed so far
from transformers import TrainingArguments, Trainer

repo_name = "finetuning-sentiment-model-3000-samples"

training_args = TrainingArguments(output_dir=repo_name,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=16,
                                  per_device_eval_batch_size=16,
                                  num_train_epochs=2, #10
                                  weight_decay=0.01,
                                  save_strategy="epoch",
                                  push_to_hub=False,) # True)

trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=tokenized_train,
                  eval_dataset=tokenized_test,
                  tokenizer=tokenizer,
                  data_collator=data_collator,
                  compute_metrics=compute_metrics)

In [58]:
# Train the model
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.3311
1000,0.2346
1500,0.229
2000,0.1609
2500,0.1502
3000,0.1363


TrainOutput(global_step=3126, training_loss=0.20449344393387867, metrics={'train_runtime': 2447.0801, 'train_samples_per_second': 20.433, 'train_steps_per_second': 1.277, 'total_flos': 6555662533661952.0, 'train_loss': 0.20449344393387867, 'epoch': 2.0})

In [59]:
# Compute the evaluation metrics
trainer.evaluate()

{'eval_loss': 0.23920759558677673,
 'eval_accuracy': 0.93156,
 'eval_f1': 0.9316938799952095,
 'eval_runtime': 452.902,
 'eval_samples_per_second': 55.2,
 'eval_steps_per_second': 3.451,
 'epoch': 2.0}

# RESLTS ANALYSIS

##Train Sample Size = 3000, Test Sample Size = 300

### AFTER 2 EPOCHS:

'eval_loss': 0.30384764075279236,

'eval_accuracy': 0.8733333333333333,

'eval_f1': 0.8741721854304636,

'eval_runtime': 6.7095,

'eval_samples_per_second': 44.713,

'eval_steps_per_second': 2.832,

'epoch': 2.0

### AFTER 10 EPOCHS:
'eval_loss': 0.734824538230896,

'eval_accuracy': 0.88,

'eval_f1': 0.8846153846153846,

'eval_runtime': 5.9033,

'eval_samples_per_second': 50.819,

'eval_steps_per_second': 3.219,

'epoch': 10.0

##Train Sample Size = 25000, Test Sample Size = 25000

### AFTER 2 EPOCHS:

'eval_loss': 0.23920759558677673,

'eval_accuracy': 0.93156,

'eval_f1': 0.9316938799952095,

'eval_runtime': 452.902,

'eval_samples_per_second': 55.2,

'eval_steps_per_second': 3.451,

'epoch': 2.0


In [39]:
# Evaluate Trainer
trainer.model, #dir(trainer), #dir(Trainer),


(DistilBertForSequenceClassification(
   (distilbert): DistilBertModel(
     (embeddings): Embeddings(
       (word_embeddings): Embedding(30522, 768, padding_idx=0)
       (position_embeddings): Embedding(512, 768)
       (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
       (dropout): Dropout(p=0.1, inplace=False)
     )
     (transformer): Transformer(
       (layer): ModuleList(
         (0-5): 6 x TransformerBlock(
           (attention): MultiHeadSelfAttention(
             (dropout): Dropout(p=0.1, inplace=False)
             (q_lin): Linear(in_features=768, out_features=768, bias=True)
             (k_lin): Linear(in_features=768, out_features=768, bias=True)
             (v_lin): Linear(in_features=768, out_features=768, bias=True)
             (out_lin): Linear(in_features=768, out_features=768, bias=True)
           )
           (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
           (ffn): FFN(
             (dropout): Dropout(p=

# 4. Analyzing new data with the model

In [None]:
# Upload the model to the Hub
#trainer.push_to_hub()

Saving model checkpoint to finetuning-sentiment-model-3000-samples
Configuration saved in finetuning-sentiment-model-3000-samples/config.json
Model weights saved in finetuning-sentiment-model-3000-samples/pytorch_model.bin
tokenizer config file saved in finetuning-sentiment-model-3000-samples/tokenizer_config.json
Special tokens file saved in finetuning-sentiment-model-3000-samples/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.36k/255M [00:00<?, ?B/s]

Upload file runs/Dec30_20-45-19_ec1521c3c99e/events.out.tfevents.1640897680.ec1521c3c99e.74.13: 100%|#########…

Upload file runs/Dec30_20-45-19_ec1521c3c99e/events.out.tfevents.1640897126.ec1521c3c99e.74.11:  96%|#########…

To https://huggingface.co/federicopascual/finetuning-sentiment-model-3000-samples
   389a6f9..b61d739  main -> main

To https://huggingface.co/federicopascual/finetuning-sentiment-model-3000-samples
   b61d739..11f7d32  main -> main



'https://huggingface.co/federicopascual/finetuning-sentiment-model-3000-samples/commit/b61d739ca657786166cace180d0da7f86e1f5bd2'

In [None]:
# Run inferences with your new model using Pipeline
from transformers import pipeline

sentiment_model = pipeline(model="federicopascual/finetuning-sentiment-model-3000-samples")

sentiment_model(["I love this move", "This movie sucks!"])

Downloading:   0%|          | 0.00/2.05k [00:00<?, ?B/s]

loading configuration file https://huggingface.co/federicopascual/finetuning-sentiment-model-3000-samples/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/6acaf454bd6ace68f425fac5d8bbfc119633e7e1f8f653b4e828a89a9ef13e75.7a44e794a6ac10a8d9a844cfcb5e74a5d94ae645ec8273f74ab0f0784099bc9b
Model config DistilBertConfig {
  "_name_or_path": "federicopascual/finetuning-sentiment-model-3000-samples",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.15.0",
  "vocab_size": 30522
}

loading c

[{'label': 'LABEL_1', 'score': 0.9558863043785095},
 {'label': 'LABEL_0', 'score': 0.9413502216339111}]