# PyTorch Dataset and DataLoader

In this tutorial, we will learn how to:

- Create a custom dataset using `torch.utils.data.Dataset`
- Use `DataLoader` to batch and shuffle data
- Apply simple transformations (if needed)

These are essential for training neural networks efficiently in PyTorch.


In [1]:
import torch
from torch.utils.data import Dataset, DataLoader

## Creating a Dummy Dataset

Let's simulate a dataset of student grades. Each student has:

- A feature vector: `[hours_studied, hours_slept]`
- A target: `score`

We'll create a custom Dataset class to manage this data.


In [2]:
# Sample data: [hours_studied, hours_slept], target: score out of 100
students_data = [
    [2, 9], [4, 8], [6, 7], [8, 6], [10, 5],
    [1, 9], [3, 8], [5, 7], [7, 6], [9, 5]
]
scores = [50, 60, 70, 80, 90, 45, 55, 65, 75, 85]


## Defining a Custom Dataset

We inherit from `torch.utils.data.Dataset` and override:

- `__init__`: to load and initialize data
- `__len__`: to return the total number of samples
- `__getitem__`: to fetch a data sample given an index


In [None]:
class StudentDataset(Dataset):
    def __init__(self, features, targets):
        self.features = torch.tensor(features, dtype=torch.float32)
        self.targets = torch.tensor(targets, dtype=torch.float32)
    
    def __len__(self):
        # Return total number of samples
        return len(self.features)
    
    def __getitem__(self, index):
        # Return one sample of data and label
        return self.features[index], self.targets[index]


## Initialize Dataset

We now create an instance of the `StudentDataset` class.


In [11]:
dataset = StudentDataset(students_data, scores)

# Let's inspect one sample
print(dataset[0])


(tensor([2., 9.]), tensor(50.))


## Using DataLoader

The `DataLoader` provides:

- Automatic batching
- Shuffling of data
- Parallel data loading (via multiprocessing)

We use it to efficiently iterate over the dataset during training.


In [12]:
# Create a DataLoader
data_loader = DataLoader(dataset, batch_size=3, shuffle=True)

# Iterate through batches
for batch in data_loader:
    features, targets = batch
    print("Features:\n", features)
    print("Targets:\n", targets)
    print("-" * 30)


Features:
 tensor([[4., 8.],
        [8., 6.],
        [3., 8.]])
Targets:
 tensor([60., 80., 55.])
------------------------------
Features:
 tensor([[ 1.,  9.],
        [ 2.,  9.],
        [10.,  5.]])
Targets:
 tensor([45., 50., 90.])
------------------------------
Features:
 tensor([[7., 6.],
        [6., 7.],
        [9., 5.]])
Targets:
 tensor([75., 70., 85.])
------------------------------
Features:
 tensor([[5., 7.]])
Targets:
 tensor([65.])
------------------------------
