<a href="https://colab.research.google.com/github/shashanksrajak/neural-networks-from-zero/blob/main/hands-on/ch13-data-preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ch 13 : Loading and Preprocessing Data

In this chapter we aim to learn how to load and preprocess data using `tensorflow` and `keras` apis.

At the heart of this process, lies `tf.data` api which has lot of features to manage data for training.

Tensorflow doc -> https://www.tensorflow.org/guide/data

In [1]:
import tensorflow as tf

## tf.data API

In [2]:
X = tf.range(10)
X

<tf.Tensor: shape=(10,), dtype=int32, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)>

In [11]:
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<_TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>

In [13]:
for item in dataset:
  print(item.numpy())

0
1
2
3
4
5
6
7
8
9


NOTE: `tf.data` is like a streaming API, we can iterate over the dataset, but we can not index or slice it.

In [5]:
dataset[0]

TypeError: '_TensorSliceDataset' object is not subscriptable

## Chaining Transformations

the `dataset` methods do not change the datasets, they create new ones so its better to keep storing them in some variable to use them further.

In [14]:
dataset = dataset.repeat(3).batch(7)

for item in dataset:
  print(item) # now it will print a batch of 7 instead of just one tensor

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


In [15]:
dataset = dataset.map(lambda x : x*2)

for item in dataset:
  print(item)

tf.Tensor([ 0  2  4  6  8 10 12], shape=(7,), dtype=int32)
tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int32)
tf.Tensor([ 8 10 12 14 16 18  0], shape=(7,), dtype=int32)
tf.Tensor([ 2  4  6  8 10 12 14], shape=(7,), dtype=int32)
tf.Tensor([16 18], shape=(2,), dtype=int32)


When we only want few items out of all the items, we can use `take(x)`

In [16]:
for item in dataset.take(2):
  print(item)

tf.Tensor([ 0  2  4  6  8 10 12], shape=(7,), dtype=int32)
tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int32)


## Shuffling the data

In [17]:
dataset = tf.data.Dataset.range(10).repeat(2)
dataset = dataset.shuffle(buffer_size=4, seed=42).batch(7)

for item in dataset:
  print(item)

tf.Tensor([1 4 2 3 5 0 6], shape=(7,), dtype=int64)
tf.Tensor([9 8 2 0 3 1 4], shape=(7,), dtype=int64)
tf.Tensor([5 7 9 6 7 8], shape=(6,), dtype=int64)
