**Install and Import the Necessary Libraries**  
   We’ll need to install the following (if you haven’t already):  
   - `datasets` (Hugging Face’s library for datasets)  
   - `transformers` (Hugging Face’s library for models, tokenizers, etc.)  
   - `torch` (PyTorch)  


In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

**Load a Dataset using Hugging Face `datasets`**  
The `datasets` library provides a convenient `load_dataset` function to load many popular NLP datasets.  
- For example, let’s load the [IMDb dataset](https://huggingface.co/datasets/imdb), which is a sentiment classification dataset.


In [4]:
dataset = load_dataset("imdb")  # This returns a dictionary-like object with 'train' and 'test' splits
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


 **Choose and Load a Model & Tokenizer**  
   Using Hugging Face Transformers, you often load a pre-trained model and corresponding tokenizer. For example, we can load a pretrained [BERT-base-uncased](https://huggingface.co/bert-base-uncased):


In [5]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

**Preprocess the Dataset with the Tokenizer**  
   You must tokenize the raw texts so that they become valid model inputs.  
   - We typically apply the tokenizer on each text sample.  
   - The tokenizer will return a dictionary with (by default) `input_ids` and `attention_mask`.  
   - We will also keep the labels.

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [7]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128, add_special_tokens=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

**Convert Dataset Splits to PyTorch-friendly Format**  
   The Hugging Face `Dataset` object can be converted to PyTorch tensors using `set_format`, or you can create a custom `Dataset` subclass. A common approach is:

In [8]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [9]:
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

In [10]:
train_dataset = tokenized_dataset["train"]
test_dataset = tokenized_dataset["test"]

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)


In [11]:
next(iter(train_loader))

{'label': tensor([1, 0, 1, 1, 1, 0, 0, 0]),
 'input_ids': tensor([[  101,  2040,  2081,  ..., 10334,  1005,   102],
         [  101,  1045,  2428,  ...,  2007,  1996,   102],
         [  101,  2025,  2069,  ...,  1997,  1037,   102],
         ...,
         [  101,  1999,  1037,  ...,  1996, 11967,   102],
         [  101,  2008,  1005,  ...,  2003,  2204,   102],
         [  101,  2079,  2025,  ...,  2009,  2172,   102]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]])}

In [12]:
batch = next(iter(train_loader))
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
labels = batch["label"]


In [13]:
print(' '.join(tokenizer.convert_ids_to_tokens(input_ids[0])))

[CLS] john boo ##rman ' s 1998 the general was hailed as a major comeback , though it ' s hard to see why on the evidence of the film itself . one of three films made that year about famed northern irish criminal martin ca ##hill ( alongside ordinary decent criminal and vicious circles ) , it has an abundance of incident and style ( the film was shot in colour but released in b & w scope in some territories ) but makes absolutely no impact and just goes on forever . with a main character who threatens witnesses , car bombs doctors , causes a hundred people to lose their jobs , tries to buy off the sexually abused daughter of one of his [SEP]


In [14]:
model(input_ids=input_ids, attention_mask=attention_mask)

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-2.8016e-01, -9.4245e-03, -1.4533e-01,  ..., -3.6636e-01,
           5.4831e-01,  3.8091e-01],
         [ 2.0655e-01, -8.2792e-03,  1.4541e-01,  ...,  9.8328e-02,
          -2.3967e-01, -6.6597e-01],
         [-1.9836e-01, -2.0848e-01,  1.9919e-01,  ...,  2.6854e-01,
           6.3531e-02,  3.0949e-01],
         ...,
         [-5.5118e-01, -5.0483e-01,  2.4049e-01,  ...,  7.3778e-02,
          -4.2784e-01,  1.5624e-01],
         [ 1.5739e-01,  2.1491e-01,  9.3533e-01,  ...,  3.7672e-01,
           3.5113e-01, -2.8569e-01],
         [ 3.4331e-01,  5.3125e-01, -1.0440e-01,  ...,  9.4862e-02,
          -7.2419e-01, -3.1751e-01]],

        [[-4.0850e-02, -1.0721e-01,  6.6465e-01,  ..., -1.8922e-01,
           2.3761e-01,  4.4487e-01],
         [-4.6778e-04,  2.0396e-01, -7.4568e-01,  ...,  3.4672e-01,
           1.0196e+00,  2.9987e-01],
         [ 1.5680e-01, -4.6012e-01,  4.9192e-01,  ...,  8.5190e-01,
           7.

In [4]:
a = torch.randint(2,10,(3,1,4,4))

In [7]:
a.view(3,-1)

tensor([[4, 8, 3, 6, 2, 5, 5, 4, 9, 4, 7, 5, 8, 4, 6, 8],
        [3, 5, 2, 9, 5, 3, 7, 9, 6, 7, 8, 6, 8, 3, 5, 9],
        [7, 5, 8, 9, 7, 9, 8, 3, 7, 9, 4, 2, 7, 3, 4, 4]])