<a href="https://colab.research.google.com/github/sufiyansayyed19/myTorch/blob/main/11_Dataset_%26_DataLoader_Utilities_(Minimal%2C_Conceptual).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notebook Goal

Provide a minimal, conceptual, and example-driven reference for PyTorch Dataset and DataLoader utilities so batching, shuffling, and data iteration are clear.

## Prerequisites

Basic understanding of tensors.
Awareness that training uses batches of data.

## After This Notebook You Can

Explain why Dataset and DataLoader exist.
Create a minimal custom Dataset.
Use DataLoader with batch_size and shuffle correctly.
Explain data loading concepts in interviews.

## Out of Scope

Data augmentation.
Distributed data loading.
Advanced performance tuning.

---

## METHODS / CONCEPTS COVERED (SUMMARY)

Data abstractions:

* torch.utils.data.Dataset
* torch.utils.data.DataLoader

Key ideas:

* batching
* shuffling
* iteration

---

## Dataset (Concept)

What it is:
An object that knows how to return one data sample at a time.

Why it exists:
To decouple data access from training logic.

Two required methods:

* **len**
* **getitem**

Minimal example:

```python
import torch
from torch.utils.data import Dataset

class SimpleDataset(Dataset):
    def __init__(self):
        self.x = torch.tensor([1.0, 2.0, 3.0, 4.0])
        self.y = torch.tensor([2.0, 4.0, 6.0, 8.0])

    def __len__(self):
        return len(self.x)

    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

dataset = SimpleDataset()
dataset[0]
```

Common mistake:
Putting batching logic inside Dataset.

---

## DataLoader

What it does:
Wraps a Dataset and returns data in batches.

Why it exists:
To handle batching, shuffling, and iteration automatically.

Minimal example:

```python
from torch.utils.data import DataLoader

loader = DataLoader(dataset, batch_size=2, shuffle=True)

for batch_x, batch_y in loader:
    print(batch_x, batch_y)
```

Important parameters:

* batch_size
* shuffle

Common mistake:
Manually batching data instead of using DataLoader.

---

## Batching (Concept)

Why batching exists:

* Improves computational efficiency
* Stabilizes gradient estimates
* Enables vectorized operations

Key idea:

One batch = many samples processed together.

---

## Shuffling (Concept)

Why shuffling exists:

* Prevents learning order-specific patterns
* Improves generalization

Key rule:

* shuffle=True for training
* shuffle=False for validation/testing

---

## HANDS-ON PRACTICE

1. Create a Dataset with 10 samples.
2. Use DataLoader with batch_size=3 and observe batches.
3. Toggle shuffle and observe order changes.
4. Explain why Dataset should not know about batching.

---

## METHODS RECAP (ONE PLACE)

Dataset, DataLoader, batch_size, shuffle

---

## ONE-SENTENCE SUMMARY

Dataset defines how to get one sample, DataLoader defines how to batch samples.

---

In [1]:
import torch
from torch.utils.data import Dataset

class SimpleDataset(Dataset):
    def __init__(self):
        self.data = torch.arange(10)
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx]

dataset = SimpleDataset()

In [2]:
from torch.utils.data import DataLoader

loader = DataLoader(dataset)

In [3]:
batch_loader = DataLoader(dataset, batch_size=3)
# Each step will now return 3 items

In [4]:
shuffled_loader = DataLoader(dataset, batch_size=2, shuffle=True)
# Data order is randomized every epoch

In [5]:
for batch in batch_loader:
    print(f"Batch: {batch}")

Batch: tensor([0, 1, 2])
Batch: tensor([3, 4, 5])
Batch: tensor([6, 7, 8])
Batch: tensor([9])
