# **Chapter 13: Efficient Data Pipelines Guide**

## 1. Introduction to TensorFlow Data API

The Data API provides tools to efficiently load, preprocess, and feed data to your models. Key benefits include:
- Handling datasets too large to fit in memory
- Optimized performance with prefetching and parallel processing
- Seamless integration with tf.keras

### Core Concepts:
- `tf.data.Dataset`: Represents a sequence of data items
- Transformations: Methods like `map()`, `batch()`, `shuffle()`
- Iteration: Process datasets in batches during training

## 2. Creating Datasets

### 2.1 From In-Memory Data
Create datasets from Python structures:

In [1]:
# Mengimpor pustaka TensorFlow dan modul lain yang dibutuhkan
import tensorflow as tf
# Mengimpor pustaka TensorFlow dan modul lain yang dibutuhkan
import numpy as np

# From numpy arrays
data = np.array([1, 2, 3, 4, 5])
# Membuat pipeline data menggunakan tf.data API
dataset = tf.data.Dataset.from_tensor_slices(data)

# From multiple arrays (features and labels)
features = np.random.rand(100, 5)
labels = np.random.randint(0, 2, size=(100, 1))
# Membuat pipeline data menggunakan tf.data API
dataset = tf.data.Dataset.from_tensor_slices((features, labels))

# Inspect the dataset
for element in dataset.take(3):
    print(element)

(<tf.Tensor: shape=(5,), dtype=float64, numpy=array([0.44015409, 0.60324627, 0.74666876, 0.16445556, 0.65270905])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1])>)
(<tf.Tensor: shape=(5,), dtype=float64, numpy=array([0.38414335, 0.96022546, 0.25301543, 0.37172385, 0.75361583])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([0])>)
(<tf.Tensor: shape=(5,), dtype=float64, numpy=array([0.37657098, 0.0852118 , 0.6453096 , 0.83176716, 0.80311083])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([0])>)


### 2.2 From Text Files
Read data from text files line by line:

In [2]:
# Create sample text files
# Mengimpor pustaka TensorFlow dan modul lain yang dibutuhkan
import os
os.makedirs("data", exist_ok=True)
for i in range(3):
    with open(f"data/file_{i}.txt", "w") as f:
        f.write(f"Sample line 1 in file {i}\n")
        f.write(f"Sample line 2 in file {i}\n")

# Create dataset from text files
file_pattern = "data/file_*.txt"
# Membuat pipeline data menggunakan tf.data API
dataset = tf.data.Dataset.list_files(file_pattern)

def read_file(file_path):
# Membuat pipeline data menggunakan tf.data API
    return tf.data.TextLineDataset(file_path)

dataset = dataset.interleave(
    read_file,
    cycle_length=3,
# Membuat pipeline data menggunakan tf.data API
    num_parallel_calls=tf.data.AUTOTUNE
)

for line in dataset.take(4):
    print(line.numpy())

b'Sample line 1 in file 2'
b'Sample line 1 in file 1'
b'Sample line 1 in file 0'
b'Sample line 2 in file 2'


## 3. Data Preprocessing

### 3.1 Using Dataset.map()
Apply transformations to each element:

In [4]:
import tensorflow as tf

# Membuat dataset angka 0–9
dataset = tf.data.Dataset.range(10)

# Fungsi preprocessing
def square(x):
    return x ** 2

def add_noise(x):
    x = tf.cast(x, tf.float32)  # Konversi ke float32 agar kompatibel
    return x + tf.random.normal(shape=(), mean=0.0, stddev=0.1)

# Apply transformasi
dataset = dataset.map(square).map(add_noise)

# Tampilkan contoh
for element in dataset.take(5):
    print(element.numpy())


-0.1667198
0.97596556
3.9963362
9.033666
16.04558


### 3.2 Preprocessing Images
Complete pipeline for image data:

In [5]:
def preprocess_image(image_path, label):
    # Read and decode image
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image, channels=3)

    # Resize and normalize
    image = tf.image.resize(image, [224, 224])
    image = image / 255.0  # Normalize to [0,1]

    return image, label

# Example usage with dummy data
image_paths = ["path/to/image1.jpg", "path/to/image2.jpg"]  # Replace with actual paths
labels = [0, 1]

# Membuat pipeline data menggunakan tf.data API
dataset = tf.data.Dataset.from_tensor_slices((image_paths, labels))
# Membuat pipeline data menggunakan tf.data API
dataset = dataset.map(preprocess_image, num_parallel_calls=tf.data.AUTOTUNE)

## 4. Performance Optimization

### 4.1 Essential Optimization Techniques

| Technique          | Method                 | Benefit                          |
|--------------------|------------------------|----------------------------------|
| Prefetching        | `.prefetch(buffer_size)` | Overlaps data prep and training  |
| Parallel Processing| `.map(..., num_parallel_calls)` | Uses multiple CPU cores       |
| Caching            | `.cache()`             | Avoids reprocessing              |
| Batching           | `.batch(batch_size)`   | Processes data in batches        |

### 4.2 Complete Optimized Pipeline

In [6]:
# Example with all optimizations
def create_optimized_pipeline(file_pattern, batch_size=32):
    # List files
# Membuat pipeline data menggunakan tf.data API
    dataset = tf.data.Dataset.list_files(file_pattern)

    # Read files in parallel
    dataset = dataset.interleave(
# Membuat pipeline data menggunakan tf.data API
        tf.data.TextLineDataset,
# Membuat pipeline data menggunakan tf.data API
        cycle_length=tf.data.AUTOTUNE,
# Membuat pipeline data menggunakan tf.data API
        num_parallel_calls=tf.data.AUTOTUNE
    )

    # Shuffle and batch
    dataset = dataset.shuffle(buffer_size=1000)
    dataset = dataset.batch(batch_size)

    # Prefetch
# Membuat pipeline data menggunakan tf.data API
    dataset = dataset.prefetch(buffer_size=tf.data.AUTOTUNE)

    return dataset

# Usage
optimized_dataset = create_optimized_pipeline("data/*.txt")

## 5. **TFRecord** adalah format file biner efisien yang digunakan TensorFlow untuk menyimpan dan memproses data dalam skala besar. Format

### 5.1 Creating **TFRecord** adalah format file biner efisien yang digunakan TensorFlow untuk menyimpan dan memproses data dalam skala besar. Files
Efficient binary format for large datasets:

In [7]:
def create_tfrecord_example(feature, label):
    feature_dict = {
        'feature': tf.train.Feature(float_list=tf.train.FloatList(value=feature)),
        'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))
    }
    example = tf.train.Example(features=tf.train.Features(feature=feature_dict))
    return example.SerializeToString()

# Operasi terkait format file TFRecord
# Write TFRecord file
# Operasi terkait format file TFRecord
with tf.io.TFRecordWriter("data/sample.tfrecord") as writer:
    for i in range(10):
        feature = np.random.rand(5).astype(np.float32)
        label = i % 2
        example = create_tfrecord_example(feature, label)
        writer.write(example)

### 5.2 Reading **TFRecord** adalah format file biner efisien yang digunakan TensorFlow untuk menyimpan dan memproses data dalam skala besar. Files
Parse **TFRecord** adalah format file biner efisien yang digunakan TensorFlow untuk menyimpan dan memproses data dalam skala besar. data back into usable format:

In [8]:
feature_description = {
    'feature': tf.io.FixedLenFeature([5], tf.float32),
    'label': tf.io.FixedLenFeature([], tf.int64),
}

def parse_tfrecord(example_proto):
    return tf.io.parse_single_example(example_proto, feature_description)

# Membuat pipeline data menggunakan tf.data API
dataset = tf.data.TFRecordDataset("data/sample.tfrecord")
dataset = dataset.map(parse_tfrecord)

for parsed_record in dataset.take(3):
    print(parsed_record)

{'feature': <tf.Tensor: shape=(5,), dtype=float32, numpy=
array([0.10959806, 0.5421599 , 0.45390922, 0.37752235, 0.5624958 ],
      dtype=float32)>, 'label': <tf.Tensor: shape=(), dtype=int64, numpy=0>}
{'feature': <tf.Tensor: shape=(5,), dtype=float32, numpy=
array([0.60083044, 0.40484416, 0.96339387, 0.1968151 , 0.10927536],
      dtype=float32)>, 'label': <tf.Tensor: shape=(), dtype=int64, numpy=1>}
{'feature': <tf.Tensor: shape=(5,), dtype=float32, numpy=
array([0.9359239 , 0.6607104 , 0.1501723 , 0.28888416, 0.64935267],
      dtype=float32)>, 'label': <tf.Tensor: shape=(), dtype=int64, numpy=0>}


## 6. Keras Preprocessing Layers

### 6.1 Built-in Preprocessing
New Keras layers for efficient preprocessing:

In [9]:
# Mengimpor pustaka TensorFlow dan modul lain yang dibutuhkan
from tensorflow.keras.layers import Normalization, StringLookup
# Mengimpor pustaka TensorFlow dan modul lain yang dibutuhkan
import numpy as np

# Numeric feature normalization
data = np.random.rand(100, 1) * 100
# Menerapkan preprocessing seperti normalisasi data
norm_layer = Normalization()
norm_layer.adapt(data)
normalized_data = norm_layer(data)
print("Normalized mean:", np.mean(normalized_data))

# Categorical feature encoding
categories = ["cat", "dog", "bird"]
lookup_layer = StringLookup(vocabulary=categories)
encoded = lookup_layer(["dog", "cat", "bird", "dog"])
print("Encoded categories:", encoded.numpy())

Normalized mean: 4.053116e-08
Encoded categories: [2 1 3 2]


## 7. Exercises

1. Create a pipeline that reads CSV files, preprocesses numeric and categorical columns, and feeds to a model
2. Benchmark the performance difference between prefetching vs no prefetching
3. Implement a custom preprocessing layer for text data
4. Convert an image dataset to **TFRecord** adalah format file biner efisien yang digunakan TensorFlow untuk menyimpan dan memproses data dalam skala besar. format and create a loading pipeline

## 8. Key Takeaways

- The Data API provides flexible tools for efficient data loading
- Proper preprocessing is crucial for model performance
- **TFRecord** adalah format file biner efisien yang digunakan TensorFlow untuk menyimpan dan memproses data dalam skala besar. format is ideal for large datasets
- Keras preprocessing layers integrate seamlessly with models
- Optimization techniques can significantly improve training speed