<a href="https://colab.research.google.com/github/surajdusa/Assignment3-individual-project/blob/main/Untitled3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Introduction
In order to effectively train a DistilBERT model for sentiment analysis on the SST-2 dataset, which is a component of the GLUE benchmark, this project makes use of Google Colab's GPU support. Installing necessary libraries, including transformers, datasets, and torch, is part of the setup procedure for managing model operations and data processing. The method involves tokenizing text input with DistilBertTokenizer, controlling data flow with DataLoaders, and optimizing model training with the AdamW optimizer. Its goal is to categorize movie review sentiments into positive or negative groups. This framework focuses on attaining high accuracy and computational efficiency, ensuring a streamlined execution from environment setup to model training.

In [3]:
# @title Install Required Libraries
!pip install transformers datasets torch

from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
import torch

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1

Methodology
The SST-2 dataset, which is a component of the GLUE benchmark and was expressly selected for its direct application to sentiment analysis, was prepared and processed using this methodology. In order to guarantee processing speed and manageability, a subset of 150 validation data and 1,000 training samples were used. DistilBertTokenizer was used to tokenize the text input. It conforms sentences to a standard style appropriate for neural network processing, including trimming and padding sequences to a constant length of 128 tokens.

DataLoaders were used to batch and shuffle data for effective data management during training, which was essential for utilizing PyTorch's built-in optimizations during model training. Following the binary classification task's initialization, the DistilBertForSequenceClassification model was selected as the optimizer because to AdamW's reliable handling of weight updates, which is very helpful for deep learning applications' fine-tuning. Because the primary technical duties were simplified, more attention could be paid to maximizing model accuracy and training efficiency.

Load and Tokenize the Dataset

In [4]:
# Load tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Function to tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['sentence'], truncation=True, padding="max_length", max_length=128)

# Load the SST-2 dataset and tokenize a small subset for quick experiments
dataset = load_dataset("glue", "sst2")
small_train_dataset = dataset['train'].shuffle(seed=42).select(range(1000))  # Using only 1000 samples
small_valid_dataset = dataset['validation'].shuffle(seed=42).select(range(150))  # Using only 150 samples

# Apply tokenization
tokenized_train = small_train_dataset.map(tokenize_function, batched=True)
tokenized_valid = small_valid_dataset.map(tokenize_function, batched=True)

# Set format for PyTorch
tokenized_train.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_valid.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

In [5]:
print("Training set size:", len(dataset['train']))
print("Validation set size:", len(dataset['validation']))
print("\nSample data from training set:")
print(dataset['train'][0])

# Output the format and columns
print("\nDataset format and columns:")
print(dataset['train'].features)

Training set size: 67349
Validation set size: 872

Sample data from training set:
{'sentence': 'hide new secretions from the parental units ', 'label': 0, 'idx': 0}

Dataset format and columns:
{'sentence': Value(dtype='string', id=None), 'label': ClassLabel(names=['negative', 'positive'], id=None), 'idx': Value(dtype='int32', id=None)}


In [6]:
# Define batch size
batch_size = 16

# Create DataLoaders
train_dataloader = DataLoader(tokenized_train, batch_size=batch_size, shuffle=True)
validation_dataloader = DataLoader(tokenized_valid, batch_size=batch_size, shuffle=False)

In [7]:
# Initialize the model
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

# Setup device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)  # Move model to the device


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [8]:
from transformers import AdamW

# Initialize the optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

