# Dataset and Dataloader
In this section we'll to examine Dataset and Dataloader. How they works and why we use these?

<b>Dataset</b> : used for manage data and data's labels. We can do preprocessing, data Augmentation and ets. We will use torch.util.data.Dataset class for our custom Dataset<br>
<b>Dataloader</b> : used for load the batches into model that we will create in the next section. And again we will use torch.util.data.Dataloader class for our costum Dataloader

In [34]:
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
import numpy as np 
import pandas as pd

### Dataset

We are creating custom dataset class from torch.utils.data as inheritance <br>
__init__ : used for inheritance and creating class atributes (classic class defining code)<br>
__getitem__: with the help of this method we reach the data at index<br>
__len__ : it returns us the lenght of the dataset (How many data in the dataset)<br>

In [35]:
class SimpleDatasetForBinary(Dataset):
    def __init__(self):
        self.data = torch.arange(10) #creating sample dataset with the help of the torch simple operators
        self.label = self.data%2 # label = {0,1,0,1....}
    
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        data =  self.data[idx]
        label = self.label[idx]
        return data, label

In [36]:
simple_dataset = SimpleDatasetForBinary()
print("length of the dataset : ", len(simple_dataset))

length of the dataset :  10


In [37]:
for i in range(len(simple_dataset)):
    x, y = simple_dataset[i]
    print(f"Index {i}: Data={x.item()}  Label={y.item()}")

Index 0: Data=0  Label=0
Index 1: Data=1  Label=1
Index 2: Data=2  Label=0
Index 3: Data=3  Label=1
Index 4: Data=4  Label=0
Index 5: Data=5  Label=1
Index 6: Data=6  Label=0
Index 7: Data=7  Label=1
Index 8: Data=8  Label=0
Index 9: Data=9  Label=1


We don't use this class only tabular data also we can use for image data and labels.

In [38]:
class DummyImageDataset(Dataset):
    def __init__(self, transforms= None):
        self.imageData = np.random.randint(0, 256, (10, 64, 64, 3), dtype=np.uint8) #We are crating 10 simple image data with help of numpy
        self.imageLabel = np.random.randint(0, 2, (10,), dtype=np.int64)
        self.transforms = transforms

    def __len__(self):
        return len(self.imageData)
    
    def __getitem__(self, idx):
        img = self.imageData[idx]
        label = self.imageLabel[idx]
        if self.transforms:
            img  = self.transforms(img)
        return img, label
    
transform = transforms.Compose([
    transforms.ToTensor(),  # numpy → Tensor, and [H,W,C] → [C,H,W]
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

In [39]:
image_dataset = DummyImageDataset(transforms=transform)
img, label = image_dataset[1]
print("Image Type:", type(img))
print("Image Shape:", img.shape)   # [3, 64, 64]
print("Label:", label)

Image Type: <class 'torch.Tensor'>
Image Shape: torch.Size([3, 64, 64])
Label: 0


If we want to read data from a csv file. We will create a temp file then we'll read this file. 

In [40]:
df = pd.DataFrame({
    "feature1": [1.2, 2.3, 3.1, 4.7, 5.5, 6.0, 7.4, 8.1, 9.2, 10.5],
    "feature2": [3.3, 1.8, 2.9, 4.1, 5.2, 6.3, 7.8, 8.6, 9.9, 10.1],
    "target":   [0,   1,   0,   1,   1,   0,   1,   0,   1,   0]
})
df.to_csv("sample_data.csv", index=False) #Creating and saving sample csv data 

In [41]:
class csvDataset(Dataset):
    def __init__(self, csvPath):
        self.df = pd.read_csv(csvPath)

    def __len__(self):
        return len(self.df)
    
    def __getitem__(self,idx):
        x = self.df.loc[idx, ["feature1", "feature2"]].values.astype(float)
        y = self.df.loc[idx, "target"].astype(int)

        x = torch.tensor(x, dtype=torch.float32)
        y = torch.tensor(y, dtype= torch.long)

        return x,y

In [42]:
csv_dataset = csvDataset('sample_data.csv')
print("Length of dataset : ",len(csv_dataset))
for i in range(len(csv_dataset)):
    x, y = csv_dataset[i]
    print(f"index({i}) >> data : ", x, ", label : ",y)

Length of dataset :  10
index(0) >> data :  tensor([1.2000, 3.3000]) , label :  tensor(0)
index(1) >> data :  tensor([2.3000, 1.8000]) , label :  tensor(1)
index(2) >> data :  tensor([3.1000, 2.9000]) , label :  tensor(0)
index(3) >> data :  tensor([4.7000, 4.1000]) , label :  tensor(1)
index(4) >> data :  tensor([5.5000, 5.2000]) , label :  tensor(1)
index(5) >> data :  tensor([6.0000, 6.3000]) , label :  tensor(0)
index(6) >> data :  tensor([7.4000, 7.8000]) , label :  tensor(1)
index(7) >> data :  tensor([8.1000, 8.6000]) , label :  tensor(0)
index(8) >> data :  tensor([9.2000, 9.9000]) , label :  tensor(1)
index(9) >> data :  tensor([10.5000, 10.1000]) , label :  tensor(0)


As yo can see Dataset used for managing data source, importing datas, applying transform, preprocessing or some related operations. It is necessary and very important title in the pytorch lessons now we will look at next part in this section and it is "DataLoader"

### DataLoader
<ul>
<li>DataLoader is a tool used to feed your model with data, together with PyTorch's custom data class (Dataset).</li>
<li>It automatically batches the data from your Dataset, can shuffle the data, and, if you want, loads data in parallel using multiple CPU/GPU cores.</li>
<li>It is used for efficient training, easy iteration, and handling large datasets.</li>
</ul>

In [43]:
dataloader = DataLoader(
    csv_dataset,         # custom Dataset instance that created by us 
    batch_size=3,    # Data size in an epoch
    shuffle=True,    # Every epoch shuffle data
    num_workers=0    # recommend 0 for Windows user
)

for batch_x, batch_y in dataloader:
    print("Batch shape:", batch_x)
    print("Batch labels:", batch_y)

Batch shape: tensor([[7.4000, 7.8000],
        [6.0000, 6.3000],
        [5.5000, 5.2000]])
Batch labels: tensor([1, 0, 1])
Batch shape: tensor([[2.3000, 1.8000],
        [9.2000, 9.9000],
        [1.2000, 3.3000]])
Batch labels: tensor([1, 1, 0])
Batch shape: tensor([[3.1000, 2.9000],
        [8.1000, 8.6000],
        [4.7000, 4.1000]])
Batch labels: tensor([0, 0, 1])
Batch shape: tensor([[10.5000, 10.1000]])
Batch labels: tensor([0])


with the help of this DataLoader we don't need load all data to RAM.