# 3 Build a custom dataloader

In [None]:
from pathlib import Path
import numpy as np
from typing import Iterator, Tuple, List
# NB: you might get a cuda warning if you don't have a GPU available.

The problem with images is that the size grows pretty fast.

In [None]:
image_size = (180, 180, 3)

for i in [1, 10, 100]:
    size = (i, ) + image_size
    X = np.zeros(size)
    size_byte = X.nbytes
    print(f"Size for {i} images: {size_byte / (2**20)} MB")

Imagine what would happen if you actually have a million images! And no, the answer to this
is not "just get more RAM in the cloud". You actually don't need to store everything at
the same time in memory, right? So we will use the dataloader pattern to fix this problem. 

Tensorflow has a nice [collection of datasets](https://www.tensorflow.org/datasets) for machine learning tasks. Let's download the 'flower_photos' dataset. We will use that dataset for image classification later on. 

In [None]:
from mads_datasets import DatasetFactoryProvider, DatasetType
flowersfactory = DatasetFactoryProvider.create_factory(DatasetType.FLOWERS)
flowersfactory.download_data()


In [None]:
image_folder = flowersfactory.subfolder
print(image_folder)

Let's  build a datagenerator from scratch; even though there are a lot of libraries (tensorflow, pytorch, trax) that provide datagenerators for images, it is a usefull practice to learn how the inside works. 

Eventually you will encounter a task were you will need to read in data from disk, and it is always usefull if you know how to adapt to a custom case. First step is to list all files in the directory:

In [None]:
def walk_dir(path: Path) -> Iterator:
    """loops recursively through a folder

    Args:
        path (Path): folder to loop trough. If a directory
            is encountered, loop through that recursively.

    Yields:
        Generator: all paths in a folder and subdirs.
    """

    for p in Path(path).iterdir():
        if p.is_dir():
            yield from walk_dir(p)
            continue
        # resolve works like .absolute(), but it removes the "../.." parts
        # of the location, so it is cleaner
        yield p.resolve()

Note that the first file is a .txt file, so we will need to filter that.

In [None]:
paths = walk_dir(image_folder)
file1 = next(paths)
file2 = next(paths)
file1, file2

So, we now have a generator of paths in the directory. We can use a path to load an image from disk.
The stucture that is often used for storing images is to have subfolders that indicate a label. 
This is an easy way to create a dataset by a human (just drag and drop the images in the right folder to label them).

If the photo is inside the `tulips` subfolder, the class label should be `tulips`

In [None]:
from PIL import Image
file = next(paths)
img = Image.open(file)

In [None]:
img.show()

The `iter_valid_paths` function pulls all files, strips the corrects suffixes (we only want images), retrieves the classnames by gathering the names of the subfolders, and returns both

In [None]:
from mads_datasets.settings import FileTypes
for ft in FileTypes:
    print(ft)

In [None]:
def iter_valid_paths(path: Path, formats: List[FileTypes]) -> Tuple[Iterator, List[str]]:
    # gets all files in folder and subfolders
    walk = walk_dir(path)
    # retrieves foldernames as classnames
    class_names = [subdir.name for subdir in path.iterdir() if subdir.is_dir()]
    # keeps only specified formats
    formats_ = [f.value for f in formats]
    paths = (path for path in walk if path.suffix in formats_)
    return paths, class_names

In [None]:
formats = [FileTypes.JPG]
paths, class_names = iter_valid_paths(
    path = image_folder / "flower_photos",
    formats=formats
)

In [None]:
next(paths), class_names

And, last, we need the `load_image` function.

While there are multiple libraries available to load images (`pyvips`, `PIL`) the functions from `tensorflow` are the fastest for the sequence of tasks:
- load image from disk
- decode into an array of numbers
- resize the image to a fixed size
- cast to `numpy` array

In [None]:
imgpath = next(paths)
newsize = (150, 150)
img_ = Image.open(imgpath).resize(newsize, Image.LANCZOS)

In [None]:
img = np.asarray(img_)
img.shape

In [None]:
def load_image(
    path: Path, image_size: Tuple[int, int]
) -> np.ndarray:
    # load file
    img_ = Image.open(path).resize(image_size, Image.LANCZOS)
    return np.asarray(img_)

In [None]:
%timeit load_image(file, image_size=(180, 180))

In [None]:
file = next(paths)
img = load_image(file, (180, 180))
type(img), img.shape

We need to add a batchsize. This is a single image, so batchsize=1. We can do that by adding tuples like this:

In [None]:
(1,) + img.shape

In [None]:
x = np.reshape(img, (1,) + img.shape)
x.shape

Lets have a look at the image we loaded

In [None]:
Image.fromarray(img.astype(np.uint8))

With this, we can construct our own data generator, using the design pattern we looked at in lesson 2.

- We gather all the paths to files
- We shuffle the index_list 
- For the range of `batchsize`, we use the `index_list[index]` design pattern to gather a random batch
- label name is extacted from the subfolder name

I implemented everything in the `src/data/data_tools.py` file, in a `Dataloader` class. Check out the file and study how I did that.

We can time this, and it is fast enough, considering we have a batchsize of 32; I clocked 2.68ms for a single image, so that would give us about 86ms for just the loading of the 32 images from disk. Depending on things like my cpu temperature, I get around 98ms for a batch. The additional 22ms for resizing, decoding and casting to numpy for 32 images comes down to about 0.7ms per image.

In [None]:
import random

class BaseDataset:
    """The main responsibility of the Dataset class is to load the data from disk
    and to offer a __len__ method and a __getitem__ method
    """

    def __init__(self, paths: List[Path]) -> None:
        self.paths = paths
        random.shuffle(self.paths)
        self.dataset: List = []
        self.process_data()

    def __len__(self) -> int:
        return len(self.dataset)

    def __getitem__(self, idx: int) -> Tuple:
        return self.dataset[idx]

class ImgDataset(BaseDataset):
    def __init__(self, paths, class_names, img_size):
        self.img_size = img_size
        self.class_names = class_names
        super().__init__(paths)

    def process_data(self) -> None:
        for file in self.paths:
            img = load_image(file, self.img_size)
            x = np.reshape(img, (1,) + img.shape)
            y = self.class_names.index(file.parent.name)
            self.dataset.append((x, y))


In [None]:
paths, class_names = iter_valid_paths(
    path = image_folder / "flower_photos",
    formats = [FileTypes.JPG],
)
dataset = ImgDataset([*paths], class_names, img_size=(150, 150))

All these methods are wrapped together inside the datasetfactory:

In [None]:
datasets = flowersfactory.create_dataset()
train = datasets["train"]

In [None]:
len(train)

In [None]:
x, y = train[1]
x.shape, y

the batch is now a pair of (img, label) tuples. However, we want to untangle a certain amount of them into a list of images and a list of labels.
Think of this as unzipping a zipper. Weirdly enough, in python we use the same command for this as we would use to create the pairs.

In [None]:
def batch_processor(batch):
    X, Y = zip(*batch)
    return np.concatenate(X), np.array(Y)

In [None]:
from mads_datasets.base import BaseDatastreamer
streamer = BaseDatastreamer(
    dataset=train,
    batchsize=32,
    preprocessor=batch_processor
)

In [None]:
gen = streamer.stream()
X, y = next(gen)
X.shape, y.shape

Et voila; we loaded 32 (img, label) pairs from the disk, and our streamer has selected 32 of those pairs and recombined them into a batch of 32 images, sized 150x150, with 3 channels (for colour). The labels are just an array of 32 labels.

In [None]:
%timeit X, y = next(gen)

In [None]:
streamers = flowersfactory.create_datastreamer(batchsize=32)

In [None]:
streamers