# Introduction

**Main Topic**
This notebook is for anyone who want to build a tfrecords dataset using [Cassava Leaf Disease competition datasets](https://www.kaggle.com/c/cassava-leaf-disease-classification)

I'll implement how to build train and validation tfrecords


**References**
- **Tensorflow Official Docs** [TFRecord and tf.train.Example](https://www.tensorflow.org/tutorials/load_data/tfrecord)
- **Yaroslav Isaienkov:** [Cassava Leaf Disease - Exploratory Data Analysis](https://www.kaggle.com/ihelon/cassava-leaf-disease-exploratory-data-analysis)
- **Jesse Mostipak, Phil Culliton:** [Getting Started: TPUs + Cassava Leaf Disease](https://www.kaggle.com/jessemostipak/getting-started-tpus-cassava-leaf-disease)
- **H. Noh, A. Araujo, J. Sim, T. Weyand and B. Han:** [Deep Local and Global Image Features Implement](https://github.com/tensorflow/models/tree/master/research/delf)

# Set up environment

In [None]:
import os
import tensorflow as tf
import numpy as np
import pandas as pd
import functools
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from glob import glob

import os
import tensorflow as tf
import numpy as np
import pandas as pd
import functools
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from glob import glob

In [None]:
cfg = {
    'data_path': '../input/cassava-leaf-disease-classification',
    'train_prefix': 'train',
    'valid_prefix': 'valid',
    'validation_ratio': 0.2,
    'num_shards' : 4
}

train_df = pd.read_csv(os.path.join(cfg['data_path'], 'train.csv'))

In [None]:
train_df.head()

# Split train-validation list

In [None]:
TRAIN_LISTS, VALID_LISTS = train_test_split(train_df, test_size=cfg['validation_ratio'], random_state=5)
TRAIN_LISTS = TRAIN_LISTS.reset_index()
VALID_LISTS = VALID_LISTS.reset_index()
print(f'train: {len(TRAIN_LISTS)}, validation: {len(VALID_LISTS)}')

# Set up `tf.train.Example`

### Data types for `tf.train.Example`

Fundamentally, a `tf.train.Example` is a `{"string": tf.train.Feature}` mapping.

The `tf.train.Feature` message type can accept one of the following three types (See the [`.proto` file](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto) for reference). Most other generic types can be coerced into one of these:

1. `tf.train.BytesList` (the following types can be coerced)

  - `string`
  - `byte`

1. `tf.train.FloatList` (the following types can be coerced)

  - `float` (`float32`)
  - `double` (`float64`)

1. `tf.train.Int64List` (the following types can be coerced)

  - `bool`
  - `enum`
  - `int32`
  - `uint32`
  - `int64`
  - `uint64`

In order to convert a standard TensorFlow type to a `tf.train.Example`-compatible `tf.train.Feature`, you can use the shortcut functions below. Note that each function takes a scalar input value and returns a `tf.train.Feature` containing one of the three `list` types above:

In [None]:
# The following functions can be used to convert a value to a type compatible
# with tf.train.Example.

def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

Note: To stay simple, this example only uses scalar inputs. The simplest way to handle non-scalar features is to use `tf.io.serialize_tensor` to convert tensors to binary-strings. Strings are scalars in tensorflow. Use `tf.io.parse_tensor` to convert the binary-string back to a tensor.

# Define write_tfrecord
You can set [shards](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset#shard) user `num_shards` if you want.

In [None]:
def image_example(image_string, label):
  image_shape = tf.image.decode_jpeg(image_string).shape

  feature = {
      'height': _int64_feature(image_shape[0]),
      'width': _int64_feature(image_shape[1]),
      'depth': _int64_feature(image_shape[2]),
      'label': _int64_feature(label),
      'image': _bytes_feature(image_string),
  }

  return tf.train.Example(features=tf.train.Features(feature=feature))

# Define `_write_tfrecord`
You can set [shards](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset#shard) user `num_shards` if you want.

In [None]:
def _write_tfrecord(output_prefix, file_list, num_shards=cfg['num_shards']):
    spacing = np.linspace(0, len(file_list), num_shards + 1, dtype=np.int)
    
    for shard in range(num_shards):
        output_file = f'{output_prefix}-00{shard + 1}-00{num_shards}.tfrec'
        print('Processing shard ', shard + 1, ' and writing file ', output_file)
        
        with tf.io.TFRecordWriter(output_file) as writer:
            for i in range(spacing[shard], spacing[shard + 1]):
                image_string = open(os.path.join(cfg['data_path'], 'train_images', file_list['image_id'][i]), 'rb').read()
                tf_example = image_example(image_string, file_list['label'][i])
                writer.write(tf_example.SerializeToString())

In [None]:
_write_tfrecord(cfg['train_prefix'], TRAIN_LISTS)
_write_tfrecord(cfg['valid_prefix'], VALID_LISTS)

# load tfrecords

You can customize below modules to build tfrecords

In [None]:
class _DataAugmentationParams(object):
  """Default parameters for augmentation."""
  # The following are used for training.
  min_object_covered = 0.1
  aspect_ratio_range_min = 3. / 4
  aspect_ratio_range_max = 4. / 3
  area_range_min = 0.08
  area_range_max = 1.0
  max_attempts = 100
  update_labels = False
  # 'central_fraction' is used for central crop in inference.
  central_fraction = 0.875

  random_reflection = False
  input_rows = 224
  input_cols = 224

In [None]:
def NormalizeImages(images, pixel_value_scale=0.5, pixel_value_offset=0.5):
  ## Normalize pixel values in image.
  images = tf.cast(images, tf.float32)
  normalized_images = tf.math.divide(
      tf.subtract(images, pixel_value_offset), pixel_value_scale)
  return normalized_images


def _ImageNetCrop(image):
  ##Imagenet-style crop with random bbox and aspect ratio.
  params = _DataAugmentationParams()
  bbox = tf.constant([0.0, 0.0, 1.0, 1.0], dtype=tf.float32, shape=[1, 1, 4])
  (bbox_begin, bbox_size, _) = tf.image.sample_distorted_bounding_box(
      tf.shape(image),
      bounding_boxes=bbox,
      min_object_covered=params.min_object_covered,
      aspect_ratio_range=(params.aspect_ratio_range_min,
                          params.aspect_ratio_range_max),
      area_range=(params.area_range_min, params.area_range_max),
      max_attempts=params.max_attempts,
      use_image_if_no_bounding_boxes=True)
  cropped_image = tf.slice(image, bbox_begin, bbox_size)
  cropped_image.set_shape([None, None, 3])

  cropped_image = tf.image.resize(
      cropped_image, [params.input_rows, params.input_cols], method='area')
  if params.random_reflection:
    cropped_image = tf.image.random_flip_left_right(cropped_image)

  return cropped_image

In [None]:
def _ParseFunction(example, name_to_features, image_size, augmentation):
  """Parse a single TFExample to get the image and label and process the image.
  Args:
    example: a `TFExample`.
    name_to_features: a `dict`. The mapping from feature names to its type.
    image_size: an `int`. The image size for the decoded image, on each side.
    augmentation: a `boolean`. True if the image will be augmented.
  Returns:
    image: a `Tensor`. The processed image.
    label: a `Tensor`. The ground-truth label.
  """
  parsed_example = tf.io.parse_single_example(example, name_to_features)
  # Parse to get image.
  image = parsed_example['image']
  image = tf.io.decode_jpeg(image)
  image = NormalizeImages(
      image, pixel_value_scale=128.0, pixel_value_offset=128.0)
  if augmentation:
    image = _ImageNetCrop(image)
  else:
    image = tf.image.resize(image, [image_size, image_size])
    image.set_shape([image_size, image_size, 3])
  # Parse to get label.
  label = parsed_example['label']

  return image, label

In [None]:
def CreateDataset(file_pattern,
                  image_size=224,
                  batch_size=32,
                  augmentation=False,
                  seed=0):
  """Creates a dataset.
  Args:
    file_pattern: str, file pattern of the dataset files.
    image_size: int, image size.
    batch_size: int, batch size.
    augmentation: bool, whether to apply augmentation.
    seed: int, seed for shuffling the dataset.
  Returns:
     tf.data.TFRecordDataset.
  """

  filenames = tf.io.gfile.glob(file_pattern)

  dataset = tf.data.TFRecordDataset(filenames)
  dataset = dataset.repeat().shuffle(buffer_size=100, seed=seed)

  # Create a description of the features.
  feature_description = {
    'height': tf.io.FixedLenFeature([], tf.int64),
    'width': tf.io.FixedLenFeature([], tf.int64),
    'depth': tf.io.FixedLenFeature([], tf.int64),
    'label': tf.io.FixedLenFeature([], tf.int64),
    'image': tf.io.FixedLenFeature([], tf.string),
}  

  customized_parse_func = functools.partial(
      _ParseFunction,
      name_to_features=feature_description,
      image_size=image_size,
      augmentation=augmentation)
  dataset = dataset.map(customized_parse_func)
  dataset = dataset.batch(batch_size)

  return dataset

In [None]:
train_dataset = CreateDataset('train-*')

In [None]:
print(train_dataset)

In [None]:
for image, label in train_dataset.take(1):
    plt.imshow(image[0])