<a href="https://colab.research.google.com/github/razerspeed/dataflow/blob/master/data_pipeline1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To download dataset from kaggle

Generate the API token from My Account -> API
![alt text](https://drive.google.com/uc?id=1hbbD01-wtcCBNu2XI67rG5-62alucKVC)

**Upload the kaggle.json file**

In [1]:
from google.colab import files
files.upload() #this will prompt you to update the json

!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
kaggle.json


**Or enter username and key manually from kaggle.json file**

In [0]:
# !echo '{"username":"<enter here>","key":"<enter key here>"}' > kaggle.json
# !pip install -q kaggle
# !mkdir -p ~/.kaggle
# !cp kaggle.json ~/.kaggle/
# !ls ~/.kaggle
# !chmod 600 /root/.kaggle/kaggle.json

***Downloading Dataset***

In [2]:
#downloading dataset
!kaggle datasets download -d prasunroy/natural-images 
!unzip -q natural-images.zip

Downloading natural-images.zip to /content
 98% 337M/342M [00:08<00:00, 47.2MB/s]
100% 342M/342M [00:08<00:00, 42.5MB/s]


**Library used and Hyperparameters values**

In [0]:
import tensorflow as tf
import os
import numpy as np
import pandas as pd

In [0]:
BATCH_SIZE = 32
IMG_HEIGHT = 256
IMG_WIDTH = 156

In [0]:
AUTOTUNE = tf.data.experimental.AUTOTUNE

In [0]:
dir_path = "natural_images"
CLASS_NAMES=np.array(os.listdir(dir_path))

In [7]:
CLASS_NAMES

array(['airplane', 'cat', 'person', 'fruit', 'dog', 'flower', 'motorbike',
       'car'], dtype='<U9')

# **ImageDataGenerator implementation**


In [8]:
###Image data Generator class
ImageFlow = tf.keras.preprocessing.image.ImageDataGenerator()
##We are fitting the data to Image data generator.

ImageGenerator = ImageFlow.flow_from_directory(dir_path,target_size=(156,256),seed=10,batch_size=32)

Found 6899 images belonging to 8 classes.


Creating a dataset from a python list using Dataset method

In [10]:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3]) 
for element in dataset: 
  print(element) 

tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)



# ***Using tf.data Pipeline***




### **Step 1 Create a source dataset from input data**

In our case create a dataset from directory using Dataset.list_files of all files matching the pattern

In [0]:
list_ds = tf.data.Dataset.list_files(str(dir_path+'/*/*'),shuffle=False)

In [12]:
for f in list_ds.take(5):
  print(f.numpy())

b'natural_images/airplane/airplane_0000.jpg'
b'natural_images/airplane/airplane_0001.jpg'
b'natural_images/airplane/airplane_0002.jpg'
b'natural_images/airplane/airplane_0003.jpg'
b'natural_images/airplane/airplane_0004.jpg'


### **Step 2 Apply dataset transformations to preprocess the data**

In [0]:
def get_label(file_path):
  # convert the path to a list of path components
  parts = tf.strings.split(file_path, os.path.sep)
  # The second to last is the class-directory
  return parts[-2] == CLASS_NAMES
def decode_img(img):
  # convert the compressed string to a 3D uint8 tensor
  img = tf.image.decode_jpeg(img, channels=3)
  # Use `convert_image_dtype` to convert to floats in the [0,1] range.
  img = tf.image.convert_image_dtype(img, tf.float32)
  # resize the image to the desired size.
  return tf.image.resize(img, [IMG_WIDTH, IMG_HEIGHT])
def process_path(file_path):
  label = get_label(file_path)
  # load the raw data from the file as a string
  img = tf.io.read_file(file_path)
  img = decode_img(img)
  return img, label


In [0]:
labeled_ds = list_ds.map(process_path, num_parallel_calls=AUTOTUNE)

## **Step 3 Iterate over the dataset**

In [0]:
def prepare_for_training(ds, cache=True, shuffle_buffer_size=1000):
  # This is a small dataset, only load it once, and keep it in memory.
  # use `.cache(filename)` to cache preprocessing work for datasets that don't
  # fit in memory.
  if cache:
    if isinstance(cache, str):
      ds = ds.cache(cache)
    else:
      ds = ds.cache()

  ds = ds.shuffle(buffer_size=shuffle_buffer_size)

  # Repeat forever
  ds = ds.repeat()

  ds = ds.batch(BATCH_SIZE)

  # `prefetch` lets the dataset fetch batches in the background while the model
  # is training.
  ds = ds.prefetch(buffer_size=AUTOTUNE)

  return ds


In [0]:
train_ds = prepare_for_training(labeled_ds)

# **Image Loading Test**

For **ImageDataGenerator**

In [18]:
##Checking time taken to load images. 
import time

for t in range(2): 
  start = time.time()
  total_batches = 0

  batches = 0
  per_batch = 32
  for x_batch, y_batch in ImageGenerator:
      batches += 1
      if batches >= 6899/per_batch:
          total_batches = total_batches + batches
          break 
  end = time.time()
  duration = end-start
  print("{} batches: {} s".format(total_batches, duration))
  print("{:0.5f} Images/s".format(per_batch*total_batches/duration))


216 batches: 12.523882865905762 s
551.90551 Images/s
216 batches: 12.613014698028564 s
548.00539 Images/s


For **tf.data api**

In [17]:
##Time taken to load the images
import time
for i in range(2):
  t = 0
  start = time.time()
  for x, y in train_ds.take(216):
      pass
  end = time.time()
  duration = end-start
  print("{} batches: {} s".format(216, duration))
  print("{:0.5f} Images/s".format(32*216/duration))

216 batches: 11.293342590332031 s
612.04200 Images/s
216 batches: 1.0725960731506348 s
6444.17798 Images/s
