# How to load images quickly ?
*This is my first large scale computer vision competition so feel free to prove me wrong in the comment section*ðŸ˜‰

So I've created my first code for this competition and came across a little problem:<br/>
**IT'S SOOOOO SLOOOWWWWW**<br/>

So I wanted to share with you my solution to iterate quickly on this competition. 



In [None]:
import numpy as np
import pandas as pd
import os
import cv2
import matplotlib.pyplot as plt
import time

# 1. Why slow ?
## 1.1 The size of images
So first of all I didn't even suspected that loading images would be a problem. As a matter of fact, when you start tutorials on computer vision, you come across some random MNIST tutorial that loads the entire dataset into RAM without worrying at all about this problem.

In [None]:
N_IMAGES = len(os.listdir('../input/happy-whale-and-dolphin/test_images')) + len(os.listdir('../input/happy-whale-and-dolphin/train_images'))
print("Total number of images in the dataset:", N_IMAGES)

But if you do that here, you'll be in big troubles because your dataset is **BIG**, like ~80k images big...<br>
So maybe now you want to tell me:<br>
*-But MNIST is about the same number of images*<br>
And you would be right but the *little* difference is that here it's far from being 28\*28 grayscale images...<br>
Let's open a random image to find out:

In [None]:
def read_and_plot_image(image):
    t = time.time()
    image = cv2.imread(f"../input/happy-whale-and-dolphin/train_images/{image}.jpg")
    print("Opened the image in {:.2f}ms".format((time.time()-t)*1000))
    plt.imshow(image)
    plt.show()
    print("Image array shape:", image.shape)
    print("Image data type:", image.dtype)
read_and_plot_image("00144776eb476d")

Yeah, that's a *lot* of pixels, and that's why the kaggle dataset is 62.06 GB large...
And in fact, if you take the raw space taken by the images it should be even bigger than that. Which leads me to...

## 1.2 The image compression
If you have a look at the images, you will notice that thay are encoded in the *.jpg* format. This format encode an image very efficiently. It loses some information in the process, but it's doing so in a way that visually will not change much.<br>
For example, take the following image:

In [None]:
read_and_plot_image("00144776eb476d")

The image size on disk is 1.22 MB. But by performing a very clever multiplication, you will notice that...

In [None]:
print("2336 * 3504 * 3 =", 2336 * 3504 * 3)

... it should take 24.56 MB. That's the magic of jpeg happening right here.<br>
However, it comes with a cost. Indeed, while it takes less space on disk, it will however be slower to load because the computer needs to recompute the original image using fancy clever matrix computations.

# 2. So how do you make it **fast** ?
## 2.1 Less pixels
Well the answer is: do you really need 2336x3504 images to identify a dolphin fin ?
Well *maybe*...<br>
but let's pretend that we can do it with far less pixels !<br>
The goal is to reduce the image size. Their are multiple ways to do it like cropping only interesting areas of the image and / or resizing the image.<br>
As I don't know how to identify precisely the interesting parts of the image (and it's posssible that the very clever solutions that will win this competition will find a way to do it at some point), I will take the easy path and just resize every image to 224x224 squares. It's a bad idea and you probably shouldn't use it in your final solution

**Why is it a bad idea ?**<br>
Well it's a bad idea because images aren't all squares, and so this solution will squeeze images and lose some information. Furthermore, they may be details that will not be captured at this resolution

**Why do I do it anyway ?**<br>
I still do it because it will be a solid base to test quickly new ideas. Then I will bother creating better crafed datasets for the final solutions.

**Why 224x224 ?**<br>
It's completely arbitrary, feel free to fork this notebook and change the resolution (next code cell) if you're unhappy with it. (Running time is less than 2 hours)

In [None]:
IMAGE_SIZE = 224

In [None]:
image = "00144776eb476d"
image = cv2.imread(f"../input/happy-whale-and-dolphin/train_images/{image}.jpg")
image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE), interpolation=cv2.INTER_CUBIC)
print(image.shape)
plt.imshow(image)
plt.show()

As expected, the image is a little squeezed, but we can still see a fair amount of details so I'll go for it.

## 2.2 Image format
As we saw jpeg may not be ideal to load images quickly. So here we'll try to see if there are some formats that are faster to load:

In [None]:
cv2.imwrite("image.ppm", image)
cv2.imwrite("image.bmp", image)
cv2.imwrite("image.dib", image)
cv2.imwrite("image.jpg", image)
cv2.imwrite("image.jp2", image)
cv2.imwrite("image.png", image)
cv2.imwrite("image.tiff", image)
cv2.imwrite("image.tif", image)
np.save("image.npy", image)

In [None]:
%timeit -n 100 -r 10 cv2.imread(f"./image.ppm")
%timeit -n 100 -r 10 cv2.imread(f"./image.bmp")
%timeit -n 100 -r 10 cv2.imread(f"./image.dib")
%timeit -n 100 -r 10 cv2.imread(f"./image.jpg")
%timeit -n 100 -r 10 cv2.imread(f"./image.jp2")
%timeit -n 100 -r 10 cv2.imread(f"./image.png")
%timeit -n 100 -r 10 cv2.imread(f"./image.tiff")
%timeit -n 100 -r 10 cv2.imread(f"./image.tif")
%timeit -n 100 -r 10 np.load(f"./image.npy")

Well the least we can say is that for the same image, the format make a huge difference...<br>
So we will probably go with ".bmp" but it seems that we could go with ".dib" without noticing the difference.
These formats are faster because the pixels are stored as such into the memory, so no computing is needed to obtain the original image.

Before we convert everything, we have to make sure that the total dataset will fit on the harddrive:

In [None]:
ls -l --block-size=K

In [None]:
print("Expected total size of the dataset:", N_IMAGES * 148000)

So everything should be ok because the expected output size of 11.69 GB is less than the notebook output limit (19.6GB)

# 3. Okay let's do it !

In [None]:
!cp ../input/happy-whale-and-dolphin/sample_submission.csv sample_submission.csv
!cp ../input/happy-whale-and-dolphin/train.csv train.csv
!rm ./image.* 
!mkdir test_images
!mkdir train_images

In [None]:
train = pd.read_csv("./train.csv")
train["image"] = train["image"].str[:-3] + "bmp"
train.to_csv("./train.csv", index=False)
sample_submission = pd.read_csv("./sample_submission.csv")
sample_submission["inference_image"] = sample_submission["image"].str[:-3] + "bmp"
sample_submission.to_csv("./sample_submission.csv", index=False)

In [None]:
def copy_dir(dirname, base_path="../input/happy-whale-and-dolphin/"):
    print("Copying", dirname)
    path = os.path.join(base_path, dirname)
    images = list(os.listdir(path))
    n = len(images)
    for i, f in enumerate(images):
        print(f"{i}/{n}", end="\r")
        image_path = os.path.join(path, f)
        image = cv2.imread(image_path)
        image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE), interpolation=cv2.INTER_CUBIC)
        new_path = os.path.join("./", dirname, f.split('.')[0] + ".bmp")
        cv2.imwrite(new_path, image)
    
copy_dir("test_images")
copy_dir("train_images")

And Done !<br>
Now, all you have to do to use this fast dataset is to add this notebook as input data and change the path of the files to use the output of the notebook instead of the classical dataset.<br>
This can speed up your pipeline by a lot (theoreticaly up to 1000x if image reading is the only bottleneck, which is never the case in practice). If the rest of the code is not well optimized, it can still fairly speed it up as image reading is usually a huge bottlneck.<br>
To give a practical idea, on my (very bad and unoptimized) pipeline, it went 5x faster just by replacing the competition data with this one (because other bottlenecks remains)