# Fast Image Reading via Multiprocessing & Getting Image Sizes

The train dataset contains ~1.5 million images. The popular image libraries, such as imageio or cv2, read ~20 images in a second on average (the average is based on first 38k images in the train dataset). Reading 1.5 million images with this average rate would give a total reading time of 20.8 hr, which is impossibly long to keep the kaggle notebebook alive.

This reading time can be reduced by multiprocessing as can be shown in this notebook. Exploiting this faster reading time, I present here a full list of image size (xsize, ysize) and depth (channel) information for all train images.

P.S. Please don't forget to upvote, if you find the notebook useful.

In [None]:
import os
import time
import tqdm
import imageio
import numpy as np
import pandas as pd
import tensorflow as tf
from multiprocessing import Pool
import seaborn as sns; sns.set(style="white", color_codes=True)
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
base_dir = '../input/landmark-recognition-2020/'
train_csv = pd.read_csv(base_dir + 'train.csv')
sample_submission = pd.read_csv(base_dir + 'sample_submission.csv')

In [None]:
train_csv.head()

In [None]:
train_csv.info()

In [None]:
def get_image_features(lid):
    impath = base_dir + 'train/' + '/'.join(list(lid[:3])) + '/' + lid + '.jpg'
    im = imageio.imread(impath)
    xsize = im.shape[-3]
    ysize = im.shape[-2]
    depth = im.shape[-1]
    return xsize, ysize, depth

In [None]:
with Pool(4) as p:
    r = list(tqdm.tqdm(p.imap(get_image_features, train_csv.id), total=1000))

In [None]:
train_csv['xsize'] = np.array(r).T[0]
train_csv['ysize'] = np.array(r).T[1]
train_csv['depth'] = np.array(r).T[2]

In [None]:
train_csv.to_csv('train_featured.csv', index=False)

**Train Image Size Distribution**

In [None]:
g = sns.jointplot(x="xsize", y="ysize", data=train_csv)