# IE7500 BERT Sentiment Analysis Model Generation

### Install statements if needed.
##### Note: transformers[torch] installation requires a kernel restart

In [1]:
#!pip install evaluate
#!pip install transformers[torch]
#!pip install ipywidgets

In [2]:
import warnings
warnings.filterwarnings("ignore")

### Import HuggingFace dataset "imdb" and setup training and test groups

In [3]:
from datasets import load_dataset
imdb = load_dataset("imdb")

In [4]:
train_dataset = imdb["train"]
test_dataset = imdb["test"]

In [5]:
train_dataset[10:12]

{'text': ['It was great to see some of my favorite stars of 30 years ago including John Ritter, Ben Gazarra and Audrey Hepburn. They looked quite wonderful. But that was it. They were not given any characters or good lines to work with. I neither understood or cared what the characters were doing.<br /><br />Some of the smaller female roles were fine, Patty Henson and Colleen Camp were quite competent and confident in their small sidekick parts. They showed some talent and it is sad they didn\'t go on to star in more and better films. Sadly, I didn\'t think Dorothy Stratten got a chance to act in this her only important film role.<br /><br />The film appears to have some fans, and I was very open-minded when I started watching it. I am a big Peter Bogdanovich fan and I enjoyed his last movie, "Cat\'s Meow" and all his early ones from "Targets" to "Nickleodeon". So, it really surprised me that I was barely able to keep awake watching this one.<br /><br />It is ironic that this movie is 

### DistilBERT Tokenizer uses WordPiece tokenization to break words into sub-word units.
#### This help DistilBERT transformer to handle out of vocabulary words.

In [6]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [7]:
def preprocess_function(examples):
   return tokenizer(examples["text"], truncation=True)
 
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)

### Data Collator prepares batches of data for model training

In [8]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

2025-07-04 16:11:08.164568: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751659868.184531  974958 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751659868.190733  974958 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1751659868.206384  974958 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1751659868.206400  974958 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1751659868.206402  974958 computation_placer.cc:177] computation placer alr

### Builds DistilBERT model

In [9]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Creates function to evaulate models performance

In [10]:
import numpy as np
import evaluate
 
def compute_metrics(eval_pred):
   load_accuracy = evaluate.load("accuracy")
   load_f1 = evaluate.load("f1")
  
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}

In [11]:
from huggingface_hub import login
HF_TOKEN = ""  
login(token=HF_TOKEN)

### Setups training parameters

In [12]:
from transformers import TrainingArguments, Trainer
 
repo_name = "DistilBert_model_Sentiment_Analysis"
 
training_args = TrainingArguments(
   output_dir=repo_name,
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=4,
   weight_decay=0.01,
   save_strategy="epoch",
   push_to_hub=True,
)

In [13]:
from transformers import Trainer
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

### Model Training

In [14]:
trainer.train()

Step,Training Loss
500,0.2532
1000,0.1354
1500,0.085


TrainOutput(global_step=1564, training_loss=0.15441182202390394, metrics={'train_runtime': 926.104, 'train_samples_per_second': 107.979, 'train_steps_per_second': 1.689, 'total_flos': 1.32466881205224e+16, 'train_loss': 0.15441182202390394, 'epoch': 4.0})

### Model Evaluation

In [15]:
trainer.evaluate()

{'eval_loss': 0.23308244347572327,
 'eval_accuracy': 0.93084,
 'eval_f1': 0.9307818567596782,
 'eval_runtime': 83.3193,
 'eval_samples_per_second': 300.05,
 'eval_steps_per_second': 4.693,
 'epoch': 4.0}