# Instead of reading all image files in train_images folder, I decided to read all images from .tfrec datasets. 

This made time reduction! 

Here's some update info
* added some functions to convert byte format image data to numpy uint8 
* optimised conversion functions to reduce additional latency


Here's a runtime calculation result:

1. read all the .tfrec files, map, and convert into numpy array
    * CPU times: user 1min 30s, sys: 2.37 s, total: 1min 32s
    * Wall time: 1min 31s


2. read each image file in train folder using v2.imread
    * CPU times: user 2min 47s, sys: 8.81 s, total: 2min 56s
    * Wall time: 5min 25s
    
Comparing the 2 wall times, still 85.4% time reduction!!

What can you do with tfrec file?
* you can read different data types associated with a same object at once.
* (image data numpy array, image name string, and disease class int using only TFRecordDataset())  
* If you use a conventional method, you need to use separate functions to read different data types.
* (read jpeg image using imread(), image name string, and disease class int using pandas.read(), etc...)


In [None]:

# read .tfrec data

import cv2
import numpy as np # linear algebra
import tensorflow as tf
import os
import matplotlib.pyplot as plt
import pandas as pd

# Create a dictionary describing the features.
train_feature_description = {
    'image': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'image_name': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'target': tf.io.FixedLenFeature([], tf.int64, default_value=0),
}

def _parse_image_function(example_proto):
    return tf.io.parse_single_example(example_proto, train_feature_description)

def preprocess_image(image):
    image = tf.io.decode_image(image, channels=3)
    # Cassava image data size
#     image = tf.image.resize(image, [600, 800])
    return image


In [None]:
%%time

BASE_DIR = "../input/cassava-leaf-disease-classification/"
tfrec_dir = "train_tfrecords/"
tfimage_set = []

for tfName in os.listdir(os.path.join(BASE_DIR, tfrec_dir))[:]:
    train_image_dataset = tf.data.TFRecordDataset(BASE_DIR+tfrec_dir+tfName)
    
    # when you want to check the dataset structure
#     for raw_record in train_image_dataset.take(1):
#         example = tf.train.Example()
#         example.ParseFromString(raw_record.numpy())

    train_images = train_image_dataset.map(_parse_image_function)    
    
    counter=0
    for image_features in train_images:
        counter += 1
        image_raw = preprocess_image(image_features['image'])
        image_raw_int = image_raw.numpy()

        # this is when you want to display the first image from each .tfrec file
#         if(counter==1):
#             print(image_raw_int.dtype,image_raw_int.shape,image_raw_int.min(),image_raw_int.max())
#             plt.imshow(image_raw_int)
#             plt.show()
            
            

In [None]:
%%time

# conventional: 
# read each image file in train_images folder
# read associated info for each image file 

for image_name in os.listdir(os.path.join(BASE_DIR, "train_images"))[:]:
    image = cv2.imread(os.path.join(BASE_DIR, "train_images", image_name))

df_train = pd.read_csv(os.path.join(BASE_DIR, "train.csv"))
    