# Datasets and Dataloaders in PyTorch

In [139]:
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, transforms
import torch
import os
from PIL import Image
import numpy as np

For this example we will be using the Cats vs Dogs dataset. Download the dataset from this location : https://www.microsoft.com/en-us/download/details.aspx?id=54765
Once downloaded, extract the zipped file to a directory of you choice. Replace the dataDir variable with the location of this directory

In [15]:
dataDir = "C:\\Users\sunny\Downloads\kagglecatsanddogs_3367a\PetImages"
folder = os.listdir(dataDir)

## Method 1 - Use of builtin methods

First we will use builtin methods to create a dataset, utilizing the directory structure.

In [162]:
myTransform = transforms.Compose([transforms.Resize([224,224]),transforms.ToTensor()])
data = datasets.ImageFolder(dataDir,transform=myTransform)

In [163]:
data

Dataset ImageFolder
    Number of datapoints: 25000
    Root location: C:\Users\sunny\Downloads\kagglecatsanddogs_3367a\PetImages
    StandardTransform
Transform: Compose(
               Resize(size=[224, 224], interpolation=bilinear, max_size=None, antialias=None)
               ToTensor()
           )

DataLoader is the API which takes care of batching and shuffling the datasets. Create a dataloader object using the dataset that you just created. 

In [164]:
dataLoader = DataLoader(data,batch_size=10,shuffle=True)

Test out the dataloader

In [165]:
a,b =iter(dataLoader).next()
b

tensor([1, 0, 0, 0, 0, 1, 0, 0, 1, 1])

## Method 2 - Custom DataSet object

In [125]:
class myDataset(Dataset):
    def __init__(self,dataDir):
        self.dogList = os.listdir(dataDir+'\Dog')
        self.catList = os.listdir(dataDir+'\Cat')
        self.label = None
        self.dir = dataDir
    
    def __len__(self):
        return len(self.catList) + len(self.dogList)
    
    def __getitem__(self, index):
        if index<len(self.dogList):
            img = Image.open(os.path.join(self.dir,'Dog',self.dogList[index]))
            self.label = 1
        else:
            index = index - len(self.dogList)
            img = Image.open(os.path.join(self.dir,'Cat',self.catList[index]))
            self.label = 0
        img = np.asarray(img)
        img = np.resize(img,(224,224,3))
        return np.asarray(img),self.label

In [126]:
catvdog = myDataset(dataDir)

In [127]:
dataLoader3 = DataLoader(catvdog,batch_size=10,shuffle=True)
        

In [132]:
a,b =iter(dataLoader3).next()
b

tensor([1, 0, 0, 1, 1, 1, 0, 0, 1, 0])