# **Dataset and Dataloader Classes**



The `Dataset` and `DataLoader` classes are **core abstractions in PyTorch** that help separate how data is defined from how it is efficiently loaded and iterated during training.

### **Dataset Class**

The Dataset class is essentially a blueprint. When you create a custom Dataset, you decide how data is loaded and returned.

It defines:

* `__init__()` which tells how data should be loaded.
* `__len__()` which returns the total number of samples.
* `__getitem__(index)` which returns the data (and label) at the given index.

### **DataLoader Class**

The DataLoader wraps a Dataset and handles batching, shuffling, and parallel loading for you.

**DataLoader Control Flow:**

* At the start of each epoch, the DataLoader (if `shuffle=True`) shuffles indices (using a sampler).
* It divides the indices into chunks of `batch_size`.
* For each index in the chunk, data samples are fetched from the Dataset object.
* The samples are then collected and combined into a batch (using `collate_fn`).
* The batch is then returned to the main training loop.


In [2]:
from sklearn.datasets import make_classification
import torch

In [3]:
X, y = make_classification(
    n_samples=10,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_classes=2,
    random_state=42
)

In [4]:
X

array([[ 1.06833894, -0.97007347],
       [-1.14021544, -0.83879234],
       [-2.8953973 ,  1.97686236],
       [-0.72063436, -0.96059253],
       [-1.96287438, -0.99225135],
       [-0.9382051 , -0.54304815],
       [ 1.72725924, -1.18582677],
       [ 1.77736657,  1.51157598],
       [ 1.89969252,  0.83444483],
       [-0.58723065, -1.97171753]])

In [5]:
y

array([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

In [6]:
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long)

In [7]:
from torch.utils.data import Dataset, DataLoader

In [20]:
class CustomDataset(Dataset):

    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return self.features.shape[0]

    def __getitem__(self, index):
        # Any transformations can be done here, ie. Converting to GrayScale, Data Augmentation, etc.
        return self.features[index], self.labels[index]

In [21]:
dataset = CustomDataset(X, y)

In [22]:
len(dataset)

10

In [23]:
dataset[3]

(tensor([-0.7206, -0.9606]), tensor(0))

In [24]:
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

In [25]:
for batch_features, batch_labels in dataloader:
    print(batch_features)
    print(batch_labels)
    print("-"*30)

tensor([[-1.1402, -0.8388],
        [-0.7206, -0.9606]])
tensor([0, 0])
------------------------------
tensor([[ 1.7273, -1.1858],
        [-0.5872, -1.9717]])
tensor([1, 0])
------------------------------
tensor([[-2.8954,  1.9769],
        [ 1.8997,  0.8344]])
tensor([0, 1])
------------------------------
tensor([[ 1.7774,  1.5116],
        [-1.9629, -0.9923]])
tensor([1, 0])
------------------------------
tensor([[ 1.0683, -0.9701],
        [-0.9382, -0.5430]])
tensor([1, 1])
------------------------------


## **Parallel Data Loading and Training with `num_workers=4` in PyTorch**

#### **Assumptions**

* **Total samples**: 10,000
* **Batch size**: 32
* **Workers (`num_workers`)**: 4
* **Total batches per epoch**: \~312 (10,000 / 32 ≈ 312)

### **Workflow**

#### **1. Sampler and Batch Creation (Main Process)**

Before training begins for the epoch:

* The `DataLoader`'s sampler **shuffles** all 10,000 indices.
* These are then grouped into **312 batches** of 32 indices each.
* All batches are **queued up** and ready to be fetched by workers.

#### **2. Parallel Data Loading (Workers)**

At the start of the training epoch, you typically run:

```python
for batch_data, batch_labels in dataloader:
    # Training logic
```

Under the hood, as soon as iteration starts:

* The first **four batches of indices** are dispatched to the 4 workers:

  * **Worker #1** loads batch 1
  * **Worker #2** loads batch 2
  * **Worker #3** loads batch 3
  * **Worker #4** loads batch 4

Each worker:

* Calls `__getitem__()` on the dataset for each index in the batch
* Applies any defined **transforms**
* Passes samples through `collate_fn` to form a **single batch tensor**


#### **3. First Batch Returned to Main Process**

* Whichever worker finishes first sends its batch (e.g., batch 1) to the **main process**.
* The main process **yields** this batch to your training loop:

  ```python
  for batch_data, batch_labels in dataloader:
      ...
  ```


#### **4. Model Training on the Main Process**

While the first batch is being used for:

* **Forward pass**
* **Loss computation**
* **Backpropagation**

...the other three workers continue preparing their batches **in parallel**.
By the time you're done with batch 1, the **next batches are already prepared** and ready.


#### **5. Continuous Processing**

* As soon as a worker finishes one batch, it grabs the **next one**:

  * After Worker #1 finishes batch 1 → it starts batch 5
  * After Worker #2 finishes batch 2 → it starts batch 6
  * And so on...

This creates a **pipeline effect**:

* At any moment, **up to 4 batches** are being processed concurrently.


#### **6. Loop Progression**

Your training loop keeps running smoothly:

```python
for batch_data, batch_labels in dataloader:
    # forward pass
    # loss computation
    # backward pass
    # optimizer step
```

* Each iteration receives a **ready-to-use batch**
* No long I/O waits thanks to **background data loading**


#### **7. End of the Epoch**

* After \~312 iterations, all batches are processed.
* All indices are consumed, so the `DataLoader` stops yielding data.

On the **next epoch**:

* If `shuffle=True`, the sampler reshuffles indices.
* The whole process **repeats** with workers loading data in parallel again.

---


### **PyTorch Samplers**

In PyTorch, the sampler in the DataLoader determines the strategy for selecting samples from the dataset during data loading.
It controls how indices of the dataset are drawn for each batch.


### **Types of Samplers**

PyTorch provides several predefined samplers, and you can create custom ones:

1. **SequentialSampler**:
    - Samples elements sequentially, in the order they appear in the dataset.
    - *Default when* `shuffle=False`.

2. **RandomSampler**:
    - Samples elements randomly without replacement.
    - *Default when* `shuffle=True`.

3. **CustomSampler**:
   - Samplers can be customized as per the requirements of the user.

---



## **`collate_fn` in PyTorch**

The `collate_fn` in PyTorch's `DataLoader` is a function that specifies how to combine a list of samples from a dataset into a single batch.

By default, the `DataLoader` uses a simple batch collation mechanism,
but `collate_fn` allows you to **customize how the data should be processed and batched**.

One example where we manually customize the `collate_fn` is in the process of `padding` where manual padding is done.

---


## **DataLoader Important Parameters**

The DataLoader class in PyTorch comes with several parameters that allow you to customize
how data is loaded, batched, and preprocessed. Some of the most commonly used and
important parameters include:

**1. dataset (mandatory):**

* The Dataset from which the DataLoader will pull data.
* Must be a subclass of `torch.utils.data.Dataset` that implements `__getitem__` and `__len__`.

**2. batch\_size:**

* How many samples per batch to load.
* Default is 1.
* Larger batch sizes can speed up training on GPUs but require more memory.

**3. shuffle:**

* If True, the DataLoader will shuffle the dataset indices each epoch.
* Helpful to avoid the model becoming too dependent on the order of samples.

**4. num\_workers:**

* The number of worker processes used to load data in parallel.
* Setting `num_workers > 0` can speed up data loading by leveraging multiple CPU
  cores, especially if I/O or preprocessing is a bottleneck.

**5. pin\_memory:**

* If True, the DataLoader will copy tensors into pinned (page-locked) memory before
  returning them.
* This can improve GPU transfer speed and thus overall training throughput,
  particularly on CUDA systems.

**6. drop\_last:**

* If True, the DataLoader will drop the last incomplete batch if the total number of
  samples is not divisible by the batch size.
* Useful when exact batch sizes are required (for example, in some batch
  normalization scenarios).

**7. collate\_fn:**

* A callable that processes a list of samples into a batch (the default simply stacks
  tensors).
* Custom collate\_fn can handle variable-length sequences, perform custom batching
  logic, or handle complex data structures.

**8. sampler:**

* `sampler` defines the strategy for drawing samples (e.g., for handling imbalanced
  classes, or custom sampling strategies).
* `batch_sampler` works at the batch level, controlling how batches are formed.
* Typically, you don’t need to specify these if you are using `batch_size` and `shuffle`.
  However, they provide lower-level control if you have advanced requirements.

---

## **Improving the Code with Mini Batch Gradient Descent**

In [26]:
import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

In [27]:
df = pd.read_csv('https://raw.githubusercontent.com/gscdit/Breast-Cancer-Detection/refs/heads/master/data.csv')

df.drop(['id', 'Unnamed: 32'], axis=1, inplace=True)

df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [31]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('diagnosis', axis=1), df['diagnosis'], test_size=0.2, random_state=42)

In [34]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.transform(y_test)

X_train_tensor = torch.from_numpy(X_train)
X_test_tensor = torch.from_numpy(X_test)
y_train_tensor = torch.from_numpy(y_train)
y_test_tensor = torch.from_numpy(y_test)

X_train_tensor = X_train_tensor.float()
X_test_tensor = X_test_tensor.float()

In [35]:
class CustomDataset(Dataset):

    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return self.features.shape[0]

    def __getitem__(self, index):
        return self.features[index], self.labels[index]

In [36]:
train_dataset = CustomDataset(X_train_tensor, y_train_tensor)
test_dataset = CustomDataset(X_test_tensor, y_test_tensor)

In [37]:
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [49]:
# create model class
import torch
import torch.nn as nn

class Model(nn.Module):
    def __init__(self, num_features):

        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(num_features, 5),
            nn.ReLU(),
            nn.Linear(5, 1),
            nn.Sigmoid()
        )


    def forward(self, x):
        out = self.network(x)

        return out

In [50]:
# Important Parameters
learning_rate = 0.1
epochs = 25

In [51]:
model = Model(X_train_tensor.shape[1])
loss_function = nn.BCELoss()   # Binary Cross Entrophy
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)  # SDG : Stochastic Gradient

In [53]:
for epoch in range(epochs):
    for batch_features, batch_labels in train_loader:
        # Forward Pass
        y_pred = model(batch_features)   # automatically calls the forward method

        # Calculate Loss
        loss = loss_function(y_pred.flatten(), batch_labels.float())

        # Clear Gradients
        optimizer.zero_grad()

        # back prop
        loss.backward()

        # Update Parameters
        optimizer.step()

    print(f"Epoch: {epoch+1}, Loss : {loss.item():.4f}")

Epoch: 1, Loss : 0.0516
Epoch: 2, Loss : 0.3328
Epoch: 3, Loss : 0.0105
Epoch: 4, Loss : 0.0221
Epoch: 5, Loss : 0.0027
Epoch: 6, Loss : 0.0068
Epoch: 7, Loss : 0.0158
Epoch: 8, Loss : 0.0019
Epoch: 9, Loss : 0.0535
Epoch: 10, Loss : 0.0075
Epoch: 11, Loss : 0.0067
Epoch: 12, Loss : 0.0513
Epoch: 13, Loss : 0.0197
Epoch: 14, Loss : 0.0080
Epoch: 15, Loss : 0.0530
Epoch: 16, Loss : 0.0231
Epoch: 17, Loss : 0.0023
Epoch: 18, Loss : 0.0084
Epoch: 19, Loss : 0.0430
Epoch: 20, Loss : 0.0016
Epoch: 21, Loss : 0.0004
Epoch: 22, Loss : 0.0172
Epoch: 23, Loss : 0.0145
Epoch: 24, Loss : 0.0023
Epoch: 25, Loss : 0.0212


In [54]:
model.eval()
accuracy_list = []

with torch.no_grad():
    for batch_features, batch_labels in test_loader:
        y_pred = model(batch_features)
        y_pred = (y_pred > 0.5).float()
        accuracy = (y_pred.flatten() == batch_labels).float().mean()
        accuracy_list.append(accuracy.item())

overall_accuracy = sum(accuracy_list) / len(accuracy_list)
print(f"Overall Accuracy: {overall_accuracy:.4f}")

Overall Accuracy: 0.9922
