---
<span style="color:#000; font-family: 'Bebas Neue'; font-size: 2em;">BIG DATA</span>

<span style="color:#f00; font-family: 'Bebas Neue'; font-size: 1.5em;">Unit 4: Introduction to Natural Langauge Processing </span>

<span style="color:#300; font-family: 'Bebas Neue'; font-size: 1.5em;">4.2 Training Models</span>
<h4 style="color:darkblue"> Universidad de Deusto</h4>

<span style="color:#300; font-family: 'Bebas Neue'; font-size: 1em;">m.varo@deusto.es</span>

<h5 style="color:black">  8 de abril de 2025 - Donostia </h5>

---

**BERT**, short for Bidirectional Encoder Representations from Transformers, is a powerful natural language processing (NLP) model developed by Google that uses a deep neural network architecture based on the state-of-the-art transformer model.

As we said earlier, the **BERT model architecture is based on a deep neural network called a transformer**, which is different from traditional NLP models that process text one word at a time. Instead, transformers can process the entire text input all at once, which helps them to capture the relationships between words and phrases more effectively.

How does the BERT model work for text classification?
BERT uses a multi-layer bidirectional transformer encoder to represent the input text in a high-dimensional space. That means **it can take into account the entire context of each word in the sentence, which helps it to better understand the meaning of the text.**

One of the most interesting things about **BERT is that it’s a pre-trained model**. This means that BERT can be trained on massive amounts of text data, such as books, articles, and websites, **before it’s fine-tuned for specific downstream NLP tasks, including text classification.**

By pre-training on a large corpus of text data, BERT can develop a deep understanding of the underlying structure and meaning of language, making it a highly effective tool for NLP tasks. Once pre-trained, BERT can be fine-tuned for specific tasks, which allows it to adapt to the specific nuances of the task and improve its accuracy.

In [1]:
!pip install transformers



In [2]:
import nltk
import os
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertModel,get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
from torch.optim import AdamW


In [26]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\m.varo\.cache\kagglehub\datasets\lakshmi25npathi\imdb-dataset-of-50k-movie-reviews\versions\1


We will then save our data set directory and insert it as an input to the load_imdb_data() function.

In [3]:
def load_imdb_data(data_file):
    df = pd.read_csv(data_file)
    texts = df['review'].tolist()
    labels = [1 if sentiment == "positive" else 0 for sentiment in df['sentiment'].tolist()]
    return texts, labels

In [4]:
data_file = "IMDB Dataset.csv"
texts, labels = load_imdb_data(data_file)

**Create a custom dataset class for text classification**

This is a custom dataset class that helps organize movie reviews and their sentiments for our BERT model. It takes care of tokenizing the text, handling the sequence length, and providing a neat package with input IDs, attention masks, and labels for our model to learn from.


In [5]:
class TextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
            self.texts = texts
            self.labels = labels
            self.tokenizer = tokenizer
            self.max_length = max_length
    def __len__(self):
            return len(self.texts)
    def __getitem__(self, idx):
            text = self.texts[idx]
            label = self.labels[idx]
            encoding = self.tokenizer(text, return_tensors='pt', max_length=self.max_length, padding='max_length', truncation=True)
            return {'input_ids': encoding['input_ids'].flatten(), 'attention_mask': encoding['attention_mask'].flatten(), 'label': torch.tensor(label)}

**Build our customer BERT classifier**

The classifier is built on top of the famous BERT model, which is great at understanding text.
 We will then add a dropout layer to keep things in check and a linear layer to help us classify text.
 Our BERTClassifier takes in some input IDs and attention masks, and runs them through BERT and the extra layers we added. The classifier returns our output as class scores.

In [6]:
class BERTClassifier(nn.Module):
    def __init__(self, bert_model_name, num_classes):
        super(BERTClassifier, self).__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.dropout = nn.Dropout(0.1)
        self.fc = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        x = self.dropout(pooled_output)
        logits = self.fc(x)
        return logits

**Train Function**

The train() function takes the model, data loader, optimizer, scheduler, and device as its trainees. The function puts the model into training mode and then runs through each batch of data from the data loader. For each batch, it clears the optimizer’s gradients, gets the input IDs, attention masks, and labels, and feeds them to the model.

In [7]:
def train(model, data_loader, optimizer, scheduler, device):
    model.train()
    for batch in data_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        optimizer.step()
        scheduler.step()

**Evaluation Method**

In [8]:
def evaluate(model, data_loader, device):
    model.eval()
    predictions = []
    actual_labels = []
    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            _, preds = torch.max(outputs, dim=1)
            predictions.extend(preds.cpu().tolist())
            actual_labels.extend(labels.cpu().tolist())
    return accuracy_score(actual_labels, predictions), classification_report(actual_labels, predictions)

**Prediction Method**

The predict_sentiment() function acts as our evaluation method. For each batch, it gets the input IDs, attention masks, and labels and feeds them to the model. The model then gives its best predictions, which are compared to the actual labels.

Finally, the function calculates the accuracy score and a classification report to let us know how well the model did in understanding movie reviews’ sentiments.

In [9]:
def predict_sentiment(text, model, tokenizer, device, max_length=128):
    model.eval()
    encoding = tokenizer(text, return_tensors='pt', max_length=max_length, padding='max_length', truncation=True)
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)

    with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            _, preds = torch.max(outputs, dim=1)
    return "positive" if preds.item() == 1 else "negative"

**Define model parameters**

Here, we are going to set up essential parameters for fine-tuning the BERTClassifier, including the BERT model name, number of classes, maximum input sequence length, batch size, number of training epochs, and learning rate, to help the model effectively understand movie reviews and their sentiments.

In [10]:
# Set up parameters
bert_model_name = 'bert-base-uncased'
num_classes = 2
max_length = 128
batch_size = 16
num_epochs = 4
learning_rate = 2e-5

**Loading and splitting the data**

In [11]:
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)


**Initialise tokenizer, dataset and dataloader**

In [12]:
tokenizer = BertTokenizer.from_pretrained(bert_model_name)
train_dataset = TextClassificationDataset(train_texts, train_labels, tokenizer, max_length)
val_dataset = TextClassificationDataset(val_texts, val_labels, tokenizer, max_length)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)

**Set up the device and model**

In [None]:
from transformers.modeling_utils import init_empty_weights

In [17]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('Using device:', device)
model = BERTClassifier(bert_model_name, num_classes).to(device)

Using device: cpu


In [15]:
optimizer = AdamW(model.parameters(), lr=learning_rate)
total_steps = len(train_dataloader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

**Training the model**

In [16]:
for epoch in range(num_epochs):
        print(f"Epoch {epoch + 1}/{num_epochs}")
        train(model, train_dataloader, optimizer, scheduler, device)
        accuracy, report = evaluate(model, val_dataloader, device)
        print(f"Validation Accuracy: {accuracy:.4f}")
        print(report)

Epoch 1/4


KeyboardInterrupt: 

**Save the model**

In [None]:
torch.save(model.state_dict(), "bert_classifier.pth")

It does not modify the internal structure of BERT itself. Instead, it follows the standard fine-tuning procedure for BERT:

Here's what's happening:
- BERT's pre-trained layers are kept intact (i.e., the architecture stays the same).

- A classification head (usually a simple feedforward layer) is added on top of BERT.

- The entire model (BERT + classification head) is then fine-tuned on the specific text classification task (like sentiment analysis).

This means:

The BERT base model weights are adjusted slightly during training to adapt to the new task.

But the overall architecture isn't changed — no layers are added/removed inside BERT itself.

The new layer learns to use BERT's contextual embeddings to make task-specific predictions.

So in short:
No structural change to BERT, just adding a new classification layer and fine-tuning the full model on your dataset.



In [None]:
# Test sentiment prediction
test_text = "The movie was great and I really enjoyed the performances of the actors."
sentiment = predict_sentiment(test_text, model, tokenizer, device)
print("The movie was great and I really enjoyed the performances of the actors.")
print(f"Predicted sentiment: {sentiment}")

In [1]:
# Test sentiment prediction
test_text = "The movie was so bad and I would not recommend it to anyone."
sentiment = predict_sentiment(test_text, model, tokenizer, device)
print("The movie was so bad and I would not recommend it to anyone.")
print(f"Predicted sentiment: {sentiment}")

NameError: name 'predict_sentiment' is not defined

In [None]:
# Test sentiment prediction
test_text = "Worst movie of the year."
sentiment = predict_sentiment(test_text, model, tokenizer, device)
print("Worst movie of the year.")
print(f"Predicted sentiment: {sentiment}")