# Data Management 

Real datasets are messy, not ordered like MNIST.

**Case Study: Oxford Flowers Dataset**

We have a folder full of flower images named like image_00001.jpg. The labels are stored in a .mat file

**Problems in Data Preparation:**
* Access Problems
* Quality Problems
* Efficiency Problems

#### Data Pipelines

`Data Ingestion -> Data Prep -> Modeling -> Training -> Evaluation -> Deployment`


In [1]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

## Data Access

### PyTorch Dataset Class

In [None]:
class OxfordFlowersDataset(Dataset):

    #Setup where to find images and labels

    def __init__(self, root_dir):

        # lazy loading

    # How many total samples
    def __len__(self):
        # return total number of samples

    # How to get image and label number 'idx'ArithmeticError
    def __getitem__(self,idx):
        # load data and get label


### Transformation Pipeline

* Quality Problems 
* Need to handle resizing, format conversion and normalization

* All images need to be of the same size in order for pytorch to stack them into  a batch.

* Batches need to follow the format: [batch_size, channels, height, width]

* Image can be of different sizes and formats

In [None]:
# changing image sizes

# resize the shorted edge to 256, then crop a square from the center

transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
])

# convert PIL image to tensor

img_tensor = transforms.ToTensor()(img) # This also scales the pixel values from 0 to 1, and stores values in 3 channels

# normalization to mean 0 and stddev 1 spreads the values evenly



Always debug and see image transformations to ensure everything works properly

## DataLoader

1. split dataset to train, val, test
2. use DataLoader to batch and serve that data efficiently


In [None]:
from torch.utils.data import random_split

# split into train/val/test

train_size = int(0.7*len(dataset))
val_size = int(0.15*len(dataset))
test_size = len(dataset) - train_size - val_size

train_dataset, val_dataset, test_dataset = random_split(dataset, [train_size, val_size, test_size])


In [None]:
from torch.utils.data import DataLoader

#Create DataLoaders for each set

train_loader = DataLoader(train_dataset, batch_size=32)
val_loader = DataLoader(val_dataset, batch_size=32)
test_loader = DataLoader(test_dataset, batch_size=32)


In [None]:
# loop to go through all the data, one batch at a time

for images, labels in train_loader:
    ...

# Get just first batch to inspect. Quick debugging
images, labels = next(iter(train_loader))

In [None]:
# Shuffle training data

train_loader = DataLoader(train_dataset, batch_size = 32, shuffle = True)

# for val and test data, shuffling is not necessary
val_loader = DataLoader(val_dataset, batch_size = 32, shuffle = False)
test_loader = DataLoader(test_dataset, batch_size = 32, shuffle = False)

## Bugproof Pipelines

To make the pipeline more reliable before training 

1. Data Augmentation - for better generalization

On the fly augmentation during training

2. Corrupted Images - instead of crashing, keep log of errors and skip to the next image

3. Overly aggressive augmentation - augmentation should be reasonable

4. Data Tracking - which images get loaded, how often each one is accessed, how long each load takes

Common Tracking errors:

* Shuffling bugs
* Performance issues
* Data imbalance

In [None]:
# on the fly augmentation

