<a href="https://colab.research.google.com/github/shuvad23/Deep-learning-with-PyTorch/blob/main/Dataset_and_DataLoader_in_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ðŸ”¥ What Are Dataset and DataLoader in PyTorch?

- In PyTorch, data handling is divided into two major components:

  - Dataset â€“ represents and manages your data (how to access a single sample).

  - DataLoader â€“ efficiently loads the dataset in batches and prepares it for training.

They work together to feed data into your model during training and testing.


More:
- In PyTorch, managing data is split into two distinct components to keep your code modular and efficient: Dataset and DataLoader.

Here is the breakdown of how they work, followed by a complete, runnable code example.

1. The Dataset (torch.utils.data.Dataset) The Dataset acts as the storage unit. Its only job is to know how to read your specific data (images, text, CSV rows) and how big the dataset is. It abstracts away the details of where the data is stored.

  - To create a custom Dataset, you must implement a class with three specific methods:

    - __init__: Initializes the data (e.g., reads a CSV file, sets up file paths).

    - __len__: Returns the total number of samples.

    - __getitem__: Returns one specific sample (and its label) at a given index.

- Note: The Dataset class does not worry about batching or shuffling. It simply says, "You asked for item #5? Here is item #5."




2. The DataLoader (torch.utils.data.DataLoader)
- The DataLoader acts as the delivery system. It wraps around your Dataset and handles the logistics of feeding data to your model during training.

- It provides essential features automatically:

  - Batching: It stacks individual samples from the Dataset into batches (e.g., if batch_size=32, it calls __getitem__ 32 times and stacks the results).

  - Shuffling: It randomizes the order of data every epoch to prevent the model from memorizing the sequence.

  - Multiprocessing: It can load data in parallel using num_workers to speed up training

In [3]:
import torch
from torch.utils.data import Dataset, DataLoader

# --- STEP 1: Define the Custom Dataset ---
class NumberDataset(Dataset):
    def __init__(self,start,end):
        # Initialize data: a list of numbers from start to end
        self.data = list(range(start,end))
    def __len__(self):
        # Return the total number of samples
        return len(self.data)

    def __getitem__(self, index):
        # 1. Get the sample at the index
        sample = self.data[index]

        # 2. Generate a dummy label (1 if even, 0 if odd)
        label = 1 if sample % 2 == 0 else 0

        # 3. Return as tensors (Floating point for input, Long for class labels)
        # We wrap sample in a list [] to give it a shape of (1,)
        return torch.tensor([sample],dtype=torch.float32),torch.tensor(label, dtype=torch.long)


# --- STEP 2: Instantiate the Dataset ---
# Create a dataset with numbers 0 to 19 (20 items total)
my_dataset = NumberDataset(start=0,end=20)

#check one item:
print(f"Item of index 6: {my_dataset[6]}")

# --- STEP 3: Instantiate the DataLoader ---
# Batch size of 4: The loader will return chunks of 4 items at a time
my_DataLoader = DataLoader(
    my_dataset,
    batch_size=4,
    shuffle=True
    )

# --- STEP 4: Iterate (Simulating a training loop) ---
print("\nStarting Training Loop:\n")

for batch_idx, (inputs, labels) in enumerate(my_DataLoader):
    # inputs shape will be [4, 1] because batch_size=4
    # labels shape will be [4]
    print(f"Batch Index: {batch_idx}")
    print(f"Inputs Shape: {inputs.shape}")
    print(f"Labels: {labels.tolist()}")
    print("--"*20)


Item of index 6: (tensor([6.]), tensor(1))

Starting Training Loop:

Batch Index: 0
Inputs Shape: torch.Size([4, 1])
Labels: [0, 0, 0, 0]
----------------------------------------
Batch Index: 1
Inputs Shape: torch.Size([4, 1])
Labels: [1, 1, 0, 1]
----------------------------------------
Batch Index: 2
Inputs Shape: torch.Size([4, 1])
Labels: [1, 0, 0, 1]
----------------------------------------
Batch Index: 3
Inputs Shape: torch.Size([4, 1])
Labels: [0, 1, 0, 1]
----------------------------------------
Batch Index: 4
Inputs Shape: torch.Size([4, 1])
Labels: [1, 0, 1, 1]
----------------------------------------


- What happens in the code above?
  - Dataset: We created NumberDataset. When accessed, it converts a raw integer into a PyTorch tensor.

  - DataLoader: We set batch_size=4 and shuffle=True.

  - Iteration: Instead of getting numbers 0, 1, 2, 3 sequentially, the DataLoader might give us a batch like [12, 3, 19, 8]. It handles the grouping and tensor stacking automatically.

## Another example

In [4]:
from sklearn.datasets import make_classification
import torch

In [6]:
# Step 1: Create a synthetic classification dataset using sklearn
x,y = make_classification(
    n_samples = 10, # Number of samples
    n_features=2, # Number of features
    n_informative = 2, # Number of informative features
    n_redundant=0, # Number of redundant features
    n_classes=2,# Number of classes
    random_state=42 # for reproducibility
)

In [7]:
x

array([[ 1.06833894, -0.97007347],
       [-1.14021544, -0.83879234],
       [-2.8953973 ,  1.97686236],
       [-0.72063436, -0.96059253],
       [-1.96287438, -0.99225135],
       [-0.9382051 , -0.54304815],
       [ 1.72725924, -1.18582677],
       [ 1.77736657,  1.51157598],
       [ 1.89969252,  0.83444483],
       [-0.58723065, -1.97171753]])

In [8]:
y

array([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

In [9]:
x.shape

(10, 2)

In [10]:
y.shape

(10,)

In [12]:
x.dtype

dtype('float64')

In [13]:
# convert the data to Pytorch tensors
x = torch.tensor(x,dtype = torch.float32)
y = torch.tensor(y,dtype = torch.long)

In [14]:
x

tensor([[ 1.0683, -0.9701],
        [-1.1402, -0.8388],
        [-2.8954,  1.9769],
        [-0.7206, -0.9606],
        [-1.9629, -0.9923],
        [-0.9382, -0.5430],
        [ 1.7273, -1.1858],
        [ 1.7774,  1.5116],
        [ 1.8997,  0.8344],
        [-0.5872, -1.9717]])

In [15]:
y

tensor([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

In [16]:
x.dtype

torch.float32

In [None]:
from torch.utils.data import Dataset, DataLoader

In [17]:
class MyCustomDataset(Dataset):
    def __init__(self,features, labels):
        self.features = features
        self.labels = labels
    def __len__(self):
        return self.features.shape[0]
    def __getitem__(self, index):
        return self.features[index],self.labels[index]


In [18]:
dataset = MyCustomDataset(x,y)

In [21]:
len(dataset)

10

In [22]:
dataset[3]

(tensor([-0.7206, -0.9606]), tensor(0))

In [23]:
dataloader = DataLoader(dataset, batch_size = 2,shuffle=True)

In [24]:
for batch_idx, (batch_features,batch_labels) in enumerate(dataloader):
    print(f"Batch Index: {batch_idx}")
    print(f"Batch Features Shape: {batch_features.shape}")
    print(f"Batch Labels Shape: {batch_labels.shape}")
    print("--"*20)

Batch Index: 0
Batch Features Shape: torch.Size([2, 2])
Batch Labels Shape: torch.Size([2])
----------------------------------------
Batch Index: 1
Batch Features Shape: torch.Size([2, 2])
Batch Labels Shape: torch.Size([2])
----------------------------------------
Batch Index: 2
Batch Features Shape: torch.Size([2, 2])
Batch Labels Shape: torch.Size([2])
----------------------------------------
Batch Index: 3
Batch Features Shape: torch.Size([2, 2])
Batch Labels Shape: torch.Size([2])
----------------------------------------
Batch Index: 4
Batch Features Shape: torch.Size([2, 2])
Batch Labels Shape: torch.Size([2])
----------------------------------------
