# Full Pytorch Sentiment Analysis Example

select the `llm-env` kernel before running the notebook

**Note**
install jupyter dependencies if needed before using this notebook
install torch based on their official website either in CPU or GPU based

In [None]:
# pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

## Step 1: Import Libraries and Load Dataset

In [7]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.utils.data import DataLoader
from transformers import AdamW
from sklearn.metrics import classification_report

# Load IMDb dataset
dataset = load_dataset("imdb")

# Check data structure
print(dataset)
print(dataset['train'][0])  # A sample from the training data


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 342619.27 examples/s]
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 418538.08 examples/s]
Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 415787.27 examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and




## Step 2: Tokenization

In [8]:
# Load pre-trained tokenizer
model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Tokenize dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Specify PyTorch format
tokenized_dataset = tokenized_dataset.with_format("torch")
print(tokenized_dataset["train"].features)


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Map: 100%|██████████| 25000/25000 [00:07<00:00, 3333.85 examples/s]
Map: 100%|██████████| 25000/25000 [00:08<00:00, 3063.48 examples/s]
Map: 100%|██████████| 50000/50000 [00:15<00:00, 3280.98 examples/s]

{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}





## Step 3: Prepare DataLoaders

In [9]:
# Set batch size depending on your computational resource
batch_size = 16

# Create DataLoader
train_dataloader = DataLoader(tokenized_dataset["train"], batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(tokenized_dataset["test"], batch_size=batch_size)

# Check sample
for batch in train_dataloader:
    print(batch["input_ids"].shape)  # (batch_size, sequence_length)
    break

torch.Size([16, 512])


## Step 4: Define the Model

In [10]:
# Load pre-trained BERT model for classification
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)

# Check model architecture
print(model)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

## Step 5: Define Optimizer and Loss

In [11]:
# Optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Move model to GPU if available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)




BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

## Step 6: Training Loop

In [12]:
# Training function
epochs = 3

for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_dataloader:
        optimizer.zero_grad()
        
        # Move data to device
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)
        
        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()
        
        # Backward pass
        loss.backward()
        optimizer.step()
    
    avg_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch+1}/{epochs} - Loss: {avg_loss:.4f}")


Epoch 1/3 - Loss: 0.2588
Epoch 2/3 - Loss: 0.1372
Epoch 3/3 - Loss: 0.0860


## Step 7: Evaluation

In [13]:
# Evaluation function
def evaluate_model(model, dataloader):
    model.eval()
    predictions, true_labels = [], []
    
    with torch.no_grad():
        for batch in dataloader:
            # Move data to device
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["label"].to(device)
            
            # Get predictions
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            preds = torch.argmax(logits, dim=-1)
            
            predictions.extend(preds.cpu().numpy())
            true_labels.extend(labels.cpu().numpy())
    
    return predictions, true_labels

# Evaluate on the test set
test_preds, test_labels = evaluate_model(model, test_dataloader)
print(classification_report(test_labels, test_preds, target_names=["Negative", "Positive"]))


              precision    recall  f1-score   support

    Negative       0.92      0.93      0.93     12500
    Positive       0.93      0.92      0.93     12500

    accuracy                           0.93     25000
   macro avg       0.93      0.93      0.93     25000
weighted avg       0.93      0.93      0.93     25000



## Optionally save the model

<!-- # Save model and tokenizer
model.save_pretrained("sentiment-analysis-bert")
tokenizer.save_pretrained("sentiment-analysis-bert")

# Load model back
loaded_model = AutoModelForSequenceClassification.from_pretrained("sentiment-analysis-bert") -->


**Key Insights on Evaluation Metrics**

`Accuracy`: The ratio of correctly predicted instances to the total instances. Good for balanced datasets.

`Precision`: Out of the predicted positives, how many are actually positive. Important when false positives matter.

`Recall`: Out of the actual positives, how many are predicted correctly. Important when false negatives matter.

`F1 Score`: The harmonic mean of precision and recall. Balances false positives and negatives, especially for imbalanced datasets.

**Extensions**

1. Fine-Tuning: Experiment with more epochs, learning rates, and batch sizes.

2. Advanced Evaluation: Add a confusion matrix or ROC curve for visual analysis.

3. Other Models: Try other pre-trained models like `distilbert-base-uncased` for faster training.

4. Custom Data: Replace the IMDb dataset with your own text dataset.