**Install and Import the Necessary Libraries**  
   We’ll need to install the following (if you haven’t already):  
   - `datasets` (Hugging Face’s library for datasets)  
   - `transformers` (Hugging Face’s library for models, tokenizers, etc.)  
   - `torch` (PyTorch)  


In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

**Load a Dataset using Hugging Face `datasets`**  
The `datasets` library provides a convenient `load_dataset` function to load many popular NLP datasets.  
- For example, let’s load the [IMDb dataset](https://huggingface.co/datasets/imdb), which is a sentiment classification dataset.


In [3]:
dataset = load_dataset("imdb")  # This returns a dictionary-like object with 'train' and 'test' splits
print(dataset)

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


 **Choose and Load a Model & Tokenizer**  
   Using Hugging Face Transformers, you often load a pre-trained model and corresponding tokenizer. For example, we can load a pretrained [BERT-base-uncased](https://huggingface.co/bert-base-uncased):


In [4]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

**Preprocess the Dataset with the Tokenizer**  
   You must tokenize the raw texts so that they become valid model inputs.  
   - We typically apply the tokenizer on each text sample.  
   - The tokenizer will return a dictionary with (by default) `input_ids` and `attention_mask`.  
   - We will also keep the labels.

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [6]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128, add_special_tokens=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

**Convert Dataset Splits to PyTorch-friendly Format**  
   The Hugging Face `Dataset` object can be converted to PyTorch tensors using `set_format`, or you can create a custom `Dataset` subclass. A common approach is:

In [7]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [8]:
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

In [15]:
train_dataset = tokenized_dataset["train"]
test_dataset = tokenized_dataset["test"]

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)


In [16]:
next(iter(train_loader))

{'label': tensor([1, 1, 1, 0, 0, 0, 1, 0]),
 'input_ids': tensor([[  101,  3434, 17266,  ...,  1011,  4526,   102],
         [  101,  2023,  3315,  ...,  1037,  2261,   102],
         [  101,  2044,  2652,  ...,  2298,  2000,   102],
         ...,
         [  101,  2004,  1045,  ...,  3494,  1997,   102],
         [  101,  2023,  2143,  ...,  9364,  1999,   102],
         [  101,  4717, 15247,  ...,  2039,  2108,   102]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]])}

In [17]:
batch = next(iter(train_loader))
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]


In [26]:
print(' '.join(tokenizer.convert_ids_to_tokens(input_ids[0])))

[CLS] spoil ##er alert ! this movie , zero day , gives an inside to the lives of two students , andre and calvin , who feel resentment and hatred for anyone and anything associated with there school . < br / > < br / > they go on a series of self - thought out " missions " all leading up to the huge mission , which is zero day . zero days contents are not specified until the middle to the end of the movie . the viewer knows its serious and filled with hate but is never quite sure until the end . < br / > < br / > now we all know , if the movie is based on the [SEP]


In [19]:
model(input_ids=input_ids, attention_mask=attention_mask)

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-1.6437e-01, -3.0119e-01,  2.4606e-01,  ..., -4.5059e-02,
           6.1141e-01,  2.2898e-01],
         [ 3.7196e-01, -7.8483e-01,  4.1921e-01,  ...,  4.4211e-01,
           8.3071e-01, -4.8639e-01],
         [-1.4368e-01, -9.0404e-01,  2.6135e-01,  ...,  3.3414e-01,
           6.1729e-01, -2.3585e-01],
         ...,
         [ 1.4112e-01, -3.6525e-01, -6.6771e-03,  ...,  1.4221e-02,
           4.0865e-01,  2.6404e-01],
         [-2.9771e-01, -9.0166e-01, -3.2657e-01,  ...,  4.7729e-03,
           1.8783e-01,  1.1907e-02],
         [-2.3642e-01, -8.4104e-03,  4.1910e-01,  ...,  4.3990e-01,
           3.1462e-01, -6.3929e-01]],

        [[ 1.3358e-01, -1.8097e-03,  3.0410e-01,  ..., -7.3996e-02,
           4.4114e-01,  4.3563e-01],
         [-6.2384e-01,  5.5745e-01,  1.2868e-01,  ..., -1.1573e-01,
           7.2936e-01,  2.7176e-01],
         [-2.4140e-01,  2.7660e-01,  1.0382e+00,  ..., -3.3970e-01,
           3.