## 1. Pendahuluan

### Mengapa Data Pipeline Penting?

Deep Learning membutuhkan **data dalam jumlah besar**. Tantangannya:
- Data tidak muat di memory
- Loading data bisa menjadi bottleneck
- Preprocessing harus efisien
- GPU idle saat menunggu data

### Solusi TensorFlow:

1. **tf.data API**: Pipeline yang efisien untuk loading dan preprocessing
2. **TFRecord**: Format biner yang optimal untuk large datasets
3. **Keras Preprocessing Layers**: Preprocessing terintegrasi dalam model
4. **TensorFlow Datasets (TFDS)**: Library dataset siap pakai

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
import os

# Set random seed
np.random.seed(42)
tf.random.set_seed(42)

print(f"TensorFlow version: {tf.__version__}")

  if not hasattr(np, "object"):


TensorFlow version: 2.20.0


## 2. The tf.data API

`tf.data.Dataset` adalah abstraksi untuk sequence of elements yang bisa di-iterate.

### 2.1 Membuat Dataset

In [2]:
# Dari tensor
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5])

print("Dataset dari tensor:")
for item in dataset:
    print(item.numpy())

Dataset dari tensor:
1
2
3
4
5


In [3]:
# Dari multiple tensors (features dan labels)
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([0, 1, 0, 1])

dataset = tf.data.Dataset.from_tensor_slices((X, y))

print("Dataset dengan features dan labels:")
for features, label in dataset:
    print(f"Features: {features.numpy()}, Label: {label.numpy()}")

Dataset dengan features dan labels:
Features: [1 2], Label: 0
Features: [3 4], Label: 1
Features: [5 6], Label: 0
Features: [7 8], Label: 1


In [4]:
# Dari dictionary (untuk named features)
data_dict = {
    'feature1': [1, 2, 3, 4],
    'feature2': [10, 20, 30, 40],
    'label': [0, 1, 0, 1]
}

dataset = tf.data.Dataset.from_tensor_slices(data_dict)

print("Dataset dari dictionary:")
for item in dataset:
    print(item)

Dataset dari dictionary:
{'feature1': <tf.Tensor: shape=(), dtype=int32, numpy=1>, 'feature2': <tf.Tensor: shape=(), dtype=int32, numpy=10>, 'label': <tf.Tensor: shape=(), dtype=int32, numpy=0>}
{'feature1': <tf.Tensor: shape=(), dtype=int32, numpy=2>, 'feature2': <tf.Tensor: shape=(), dtype=int32, numpy=20>, 'label': <tf.Tensor: shape=(), dtype=int32, numpy=1>}
{'feature1': <tf.Tensor: shape=(), dtype=int32, numpy=3>, 'feature2': <tf.Tensor: shape=(), dtype=int32, numpy=30>, 'label': <tf.Tensor: shape=(), dtype=int32, numpy=0>}
{'feature1': <tf.Tensor: shape=(), dtype=int32, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=int32, numpy=40>, 'label': <tf.Tensor: shape=(), dtype=int32, numpy=1>}


In [5]:
# Dataset dengan range
dataset = tf.data.Dataset.range(10)
print(f"Range dataset: {list(dataset.as_numpy_iterator())}")

Range dataset: [np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9)]


### 2.2 Chaining Transformations

In [6]:
# Method chaining
dataset = tf.data.Dataset.range(10)

dataset = dataset.repeat(2)      # Repeat 2x
dataset = dataset.batch(5)       # Batch size 5

print("Chained transformations:")
for batch in dataset:
    print(batch.numpy())

Chained transformations:
[0 1 2 3 4]
[5 6 7 8 9]
[0 1 2 3 4]
[5 6 7 8 9]


## 3. Data Transformations

Transformasi umum pada dataset.

### 3.1 Map Transformation

In [7]:
# Map: apply function ke setiap element
dataset = tf.data.Dataset.range(10)

# Square setiap element
dataset = dataset.map(lambda x: x ** 2)

print(f"Squared: {list(dataset.as_numpy_iterator())}")

Squared: [np.int64(0), np.int64(1), np.int64(4), np.int64(9), np.int64(16), np.int64(25), np.int64(36), np.int64(49), np.int64(64), np.int64(81)]


In [8]:
# Map dengan function yang lebih complex
def preprocess(features, label):
    # Normalize features
    features = tf.cast(features, tf.float32) / 10.0
    # One-hot encode label
    label = tf.one_hot(label, depth=2)
    return features, label

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([0, 1, 0, 1])

dataset = tf.data.Dataset.from_tensor_slices((X, y))
dataset = dataset.map(preprocess)

print("After preprocessing:")
for features, label in dataset:
    print(f"Features: {features.numpy()}, Label: {label.numpy()}")

After preprocessing:
Features: [0.1 0.2], Label: [1. 0.]
Features: [0.3 0.4], Label: [0. 1.]
Features: [0.5 0.6], Label: [1. 0.]
Features: [0.7 0.8], Label: [0. 1.]


In [9]:
# Parallel map untuk speedup
dataset = tf.data.Dataset.range(1000)

# num_parallel_calls untuk parallelism
dataset = dataset.map(
    lambda x: x ** 2,
    num_parallel_calls=tf.data.AUTOTUNE  # Otomatis determine parallelism
)

print(f"Parallel map selesai! Dataset size: {len(list(dataset))}")

Parallel map selesai! Dataset size: 1000


### 3.2 Filter Transformation

In [10]:
# Filter: keep elements yang memenuhi kondisi
dataset = tf.data.Dataset.range(20)

# Keep only even numbers
dataset = dataset.filter(lambda x: x % 2 == 0)

print(f"Even numbers: {list(dataset.as_numpy_iterator())}")

Even numbers: [np.int64(0), np.int64(2), np.int64(4), np.int64(6), np.int64(8), np.int64(10), np.int64(12), np.int64(14), np.int64(16), np.int64(18)]


### 3.3 Take dan Skip

In [11]:
dataset = tf.data.Dataset.range(10)

# Take: ambil n elements pertama
print(f"Take 3: {list(dataset.take(3).as_numpy_iterator())}")

# Skip: skip n elements pertama
print(f"Skip 7: {list(dataset.skip(7).as_numpy_iterator())}")

Take 3: [np.int64(0), np.int64(1), np.int64(2)]
Skip 7: [np.int64(7), np.int64(8), np.int64(9)]


### 3.4 Flat Map

In [12]:
# Flat map: map lalu flatten
# Use uniform-length arrays for from_tensor_slices
dataset = tf.data.Dataset.from_tensor_slices([[1, 2, 3], [4, 5, 6]])

# Flatten nested structure
dataset = dataset.flat_map(lambda x: tf.data.Dataset.from_tensor_slices(x))

print(f"Flattened: {list(dataset.as_numpy_iterator())}")

Flattened: [np.int32(1), np.int32(2), np.int32(3), np.int32(4), np.int32(5), np.int32(6)]


## 4. Shuffling, Batching, dan Prefetching

Tiga operasi penting untuk training yang efisien.

### 4.1 Shuffling

In [13]:
# Shuffle dataset
dataset = tf.data.Dataset.range(10)

# buffer_size: jumlah elements untuk random sampling
dataset = dataset.shuffle(buffer_size=5, seed=42)

print(f"Shuffled (buffer=5): {list(dataset.as_numpy_iterator())}")

Shuffled (buffer=5): [np.int64(0), np.int64(1), np.int64(6), np.int64(5), np.int64(7), np.int64(3), np.int64(4), np.int64(8), np.int64(2), np.int64(9)]


In [14]:
# Efek buffer size
dataset = tf.data.Dataset.range(10)

print("Buffer size comparison:")
for buffer_size in [1, 3, 10]:
    shuffled = dataset.shuffle(buffer_size=buffer_size, seed=42)
    print(f"  Buffer {buffer_size}: {list(shuffled.as_numpy_iterator())}")

Buffer size comparison:
  Buffer 1: [np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9)]
  Buffer 3: [np.int64(1), np.int64(3), np.int64(0), np.int64(4), np.int64(2), np.int64(5), np.int64(6), np.int64(8), np.int64(7), np.int64(9)]
  Buffer 10: [np.int64(5), np.int64(3), np.int64(7), np.int64(1), np.int64(4), np.int64(0), np.int64(2), np.int64(8), np.int64(6), np.int64(9)]


### 4.2 Batching

In [15]:
# Batching
dataset = tf.data.Dataset.range(12)

# Batch size 4
batched = dataset.batch(4)

print("Batched (size=4):")
for batch in batched:
    print(f"  {batch.numpy()}")

Batched (size=4):
  [0 1 2 3]
  [4 5 6 7]
  [ 8  9 10 11]


In [16]:
# Drop remainder (untuk consistent batch size)
dataset = tf.data.Dataset.range(10)

batched = dataset.batch(4, drop_remainder=True)

print("Batched with drop_remainder=True:")
for batch in batched:
    print(f"  {batch.numpy()} (size: {len(batch)})")

Batched with drop_remainder=True:
  [0 1 2 3] (size: 4)
  [4 5 6 7] (size: 4)


### 4.3 Prefetching

In [17]:
# Prefetch: prepare next batch while training on current batch
dataset = tf.data.Dataset.range(1000)
dataset = dataset.shuffle(100)
dataset = dataset.batch(32)
dataset = dataset.prefetch(tf.data.AUTOTUNE)  # Otomatis buffer size

print("Pipeline dengan prefetch siap!")
print(f"Number of batches: {len(list(dataset))}")

Pipeline dengan prefetch siap!
Number of batches: 32


### 4.4 Optimal Data Pipeline

In [18]:
# Visualisasi Pipeline
print("""
Optimal Data Pipeline:

    ┌─────────────┐
    │   Dataset   │  <- Raw data
    └──────┬──────┘
           │
    ┌──────▼──────┐
    │   Shuffle   │  <- Randomize order (large buffer)
    └──────┬──────┘
           │
    ┌──────▼──────┐
    │    Map      │  <- Preprocessing (parallel)
    └──────┬──────┘
           │
    ┌──────▼──────┐
    │   Batch     │  <- Create batches
    └──────┬──────┘
           │
    ┌──────▼──────┐
    │  Prefetch   │  <- Overlap data loading & training
    └──────┬──────┘
           │
           ▼
       Training
""")


Optimal Data Pipeline:

    ┌─────────────┐
    │   Dataset   │  <- Raw data
    └──────┬──────┘
           │
    ┌──────▼──────┐
    │   Shuffle   │  <- Randomize order (large buffer)
    └──────┬──────┘
           │
    ┌──────▼──────┐
    │    Map      │  <- Preprocessing (parallel)
    └──────┬──────┘
           │
    ┌──────▼──────┐
    │   Batch     │  <- Create batches
    └──────┬──────┘
           │
    ┌──────▼──────┐
    │  Prefetch   │  <- Overlap data loading & training
    └──────┬──────┘
           │
           ▼
       Training



In [19]:
# Complete optimal pipeline
def create_optimal_pipeline(X, y, batch_size=32, shuffle_buffer=1000):
    """
    Membuat data pipeline yang optimal untuk training
    """
    dataset = tf.data.Dataset.from_tensor_slices((X, y))
    
    # Shuffle
    dataset = dataset.shuffle(buffer_size=shuffle_buffer)
    
    # Map (preprocessing) dengan parallelism
    dataset = dataset.map(
        lambda x, y: (tf.cast(x, tf.float32), y),
        num_parallel_calls=tf.data.AUTOTUNE
    )
    
    # Batch
    dataset = dataset.batch(batch_size)
    
    # Prefetch
    dataset = dataset.prefetch(tf.data.AUTOTUNE)
    
    return dataset

# Example
X = np.random.randn(1000, 10)
y = np.random.randint(0, 2, 1000)

train_dataset = create_optimal_pipeline(X, y, batch_size=32)
print(f"Optimal pipeline created! Batches: {len(list(train_dataset))}")

Optimal pipeline created! Batches: 32


## 5. Reading Data from Files

Membaca data dari berbagai format file.

### 5.1 CSV Files

In [20]:
# Buat sample CSV file
csv_data = """feature1,feature2,label
1.0,2.0,0
3.0,4.0,1
5.0,6.0,0
7.0,8.0,1
9.0,10.0,0"""

# Save to file
with open('sample_data.csv', 'w') as f:
    f.write(csv_data)

print("Sample CSV file created!")

Sample CSV file created!


In [21]:
# Read CSV dengan tf.data
def parse_csv_line(line):
    # Define column types
    defaults = [0.0, 0.0, 0]  # feature1, feature2, label
    parsed = tf.io.decode_csv(line, defaults)
    features = tf.stack(parsed[:-1])  # All except last
    label = parsed[-1]  # Last column
    return features, label

# Create dataset from CSV
dataset = tf.data.TextLineDataset('sample_data.csv')
dataset = dataset.skip(1)  # Skip header
dataset = dataset.map(parse_csv_line)

print("CSV data loaded:")
for features, label in dataset:
    print(f"  Features: {features.numpy()}, Label: {label.numpy()}")

CSV data loaded:
  Features: [1. 2.], Label: 0
  Features: [3. 4.], Label: 1
  Features: [5. 6.], Label: 0
  Features: [7. 8.], Label: 1
  Features: [ 9. 10.], Label: 0


In [22]:
# Multiple CSV files
# Buat beberapa file
for i in range(3):
    with open(f'data_part{i}.csv', 'w') as f:
        f.write(f"feature,label\n{i*10},{i}\n{i*10+1},{i}")

# Read multiple files
file_pattern = 'data_part*.csv'
filenames = tf.io.gfile.glob(file_pattern)
print(f"Found files: {filenames}")

# Interleave untuk parallel reading
dataset = tf.data.Dataset.list_files(file_pattern)
dataset = dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length=3,  # Number of files to read in parallel
    num_parallel_calls=tf.data.AUTOTUNE
)

print("\nInterleaved data:")
for line in dataset:
    print(f"  {line.numpy().decode()}")

Found files: ['.\\data_part0.csv', '.\\data_part1.csv', '.\\data_part2.csv']

Interleaved data:
  10,1
  0,0
  20,2
  11,1
  1,0
  21,2


### 5.2 Image Files

In [23]:
# Membaca image files (contoh)
def load_and_preprocess_image(filepath, label):
    # Read file
    image = tf.io.read_file(filepath)
    # Decode (supports JPEG, PNG, etc)
    image = tf.image.decode_image(image, channels=3)
    # Resize
    image = tf.image.resize(image, [224, 224])
    # Normalize
    image = image / 255.0
    return image, label

# Example usage:
# filepaths = ['path/to/image1.jpg', 'path/to/image2.jpg']
# labels = [0, 1]
# dataset = tf.data.Dataset.from_tensor_slices((filepaths, labels))
# dataset = dataset.map(load_and_preprocess_image)

print("Image loading function defined!")

Image loading function defined!


## 6. TFRecord Format

**TFRecord** adalah format biner TensorFlow yang optimal untuk large datasets.

### 6.1 Membuat TFRecord File

In [24]:
# Helper functions untuk TFRecord
def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy()
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _float_array_feature(value):
    """Returns a float_list from a numpy array."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))

print("TFRecord helper functions defined!")

TFRecord helper functions defined!


In [25]:
# Membuat TFRecord file
def serialize_example(features, label):
    """Serialize satu example ke TFRecord format"""
    feature_dict = {
        'features': _float_array_feature(features),
        'label': _int64_feature(label)
    }
    example_proto = tf.train.Example(
        features=tf.train.Features(feature=feature_dict)
    )
    return example_proto.SerializeToString()

# Generate sample data
n_samples = 100
X = np.random.randn(n_samples, 10).astype(np.float32)
y = np.random.randint(0, 2, n_samples)

# Write to TFRecord
tfrecord_file = 'sample_data.tfrecord'
with tf.io.TFRecordWriter(tfrecord_file) as writer:
    for features, label in zip(X, y):
        example = serialize_example(features, label)
        writer.write(example)

print(f"TFRecord file created: {tfrecord_file}")
print(f"File size: {os.path.getsize(tfrecord_file)} bytes")

TFRecord file created: sample_data.tfrecord
File size: 9200 bytes


### 6.2 Membaca TFRecord File

In [26]:
# Parse function untuk TFRecord
def parse_tfrecord(serialized_example):
    feature_description = {
        'features': tf.io.FixedLenFeature([10], tf.float32),
        'label': tf.io.FixedLenFeature([], tf.int64)
    }
    example = tf.io.parse_single_example(serialized_example, feature_description)
    return example['features'], example['label']

# Read TFRecord
dataset = tf.data.TFRecordDataset(tfrecord_file)
dataset = dataset.map(parse_tfrecord)

# Show first few examples
print("TFRecord data loaded:")
for features, label in dataset.take(3):
    print(f"  Features shape: {features.shape}, Label: {label.numpy()}")

TFRecord data loaded:
  Features shape: (10,), Label: 1
  Features shape: (10,), Label: 1
  Features shape: (10,), Label: 1


In [27]:
# Compressed TFRecord
compressed_file = 'sample_data.tfrecord.gz'

options = tf.io.TFRecordOptions(compression_type='GZIP')
with tf.io.TFRecordWriter(compressed_file, options=options) as writer:
    for features, label in zip(X[:10], y[:10]):
        example = serialize_example(features, label)
        writer.write(example)

print(f"Compressed TFRecord: {os.path.getsize(compressed_file)} bytes")

Compressed TFRecord: 553 bytes


## 7. Preprocessing Features

Preprocessing berbagai tipe features.

### 7.1 Numerical Features

In [28]:
# Standardization
def standardize(features, mean, std):
    return (features - mean) / std

# Normalization (min-max)
def normalize(features, min_val, max_val):
    return (features - min_val) / (max_val - min_val)

# Example
data = tf.constant([1.0, 5.0, 10.0, 15.0, 20.0])

mean = tf.reduce_mean(data)
std = tf.math.reduce_std(data)
standardized = standardize(data, mean, std)

print(f"Original: {data.numpy()}")
print(f"Standardized: {standardized.numpy()}")

Original: [ 1.  5. 10. 15. 20.]
Standardized: [-1.3541131  -0.7653683  -0.02943721  0.70649385  1.4424249 ]


### 7.2 Categorical Features

In [29]:
# One-hot encoding
categories = tf.constant([0, 1, 2, 1, 0])
one_hot = tf.one_hot(categories, depth=3)

print("One-hot encoding:")
print(one_hot.numpy())

One-hot encoding:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


In [30]:
# Vocabulary lookup (string to integer)
vocab = ['cat', 'dog', 'bird']
lookup_table = tf.lookup.StaticHashTable(
    tf.lookup.KeyValueTensorInitializer(
        keys=vocab,
        values=tf.range(len(vocab))
    ),
    default_value=-1
)

# Lookup
words = tf.constant(['dog', 'cat', 'bird', 'fish'])
indices = lookup_table.lookup(words)

print(f"Words: {words.numpy()}")
print(f"Indices: {indices.numpy()}")  # 'fish' akan jadi -1 (unknown)

Words: [b'dog' b'cat' b'bird' b'fish']
Indices: [ 1  0  2 -1]


### 7.3 Embedding

In [31]:
# Embedding layer
vocab_size = 1000
embedding_dim = 16

embedding_layer = keras.layers.Embedding(
    input_dim=vocab_size,
    output_dim=embedding_dim
)

# Example: word indices -> embeddings
word_indices = tf.constant([[1, 2, 3], [4, 5, 6]])
embeddings = embedding_layer(word_indices)

print(f"Input shape: {word_indices.shape}")
print(f"Embedding shape: {embeddings.shape}")

Input shape: (2, 3)
Embedding shape: (2, 3, 16)


## 8. Keras Preprocessing Layers

Keras menyediakan preprocessing layers yang dapat dimasukkan ke dalam model.

### 8.1 Numerical Preprocessing

In [32]:
# Normalization layer
normalizer = keras.layers.Normalization()

# Adapt to data (compute mean dan variance)
data = np.array([[1.0], [5.0], [10.0], [15.0], [20.0]])
normalizer.adapt(data)

# Transform
normalized = normalizer(data)
print(f"Original: {data.flatten()}")
print(f"Normalized: {normalized.numpy().flatten()}")
print(f"Mean: {normalizer.mean.numpy()}, Variance: {normalizer.variance.numpy()}")

Original: [ 1.  5. 10. 15. 20.]
Normalized: [-1.354113   -0.7653682  -0.02943721  0.7064938   1.4424248 ]
Mean: [[10.2]], Variance: [[46.16]]


In [33]:
# Discretization (binning)
discretizer = keras.layers.Discretization(bin_boundaries=[0, 5, 10, 15])

data = tf.constant([[-5.0], [2.0], [7.0], [12.0], [20.0]])
discretized = discretizer(data)

print(f"Original: {data.numpy().flatten()}")
print(f"Discretized: {discretized.numpy().flatten()}")

Original: [-5.  2.  7. 12. 20.]
Discretized: [0 1 2 3 4]


### 8.2 Categorical Preprocessing

In [34]:
# StringLookup - string ke integer
string_lookup = keras.layers.StringLookup(
    vocabulary=['cat', 'dog', 'bird'],
    output_mode='int'
)

words = tf.constant(['dog', 'cat', 'bird', 'fish'])
indices = string_lookup(words)

print(f"Words: {words.numpy()}")
print(f"Indices: {indices.numpy()}")

Words: [b'dog' b'cat' b'bird' b'fish']
Indices: [2 1 3 0]


In [35]:
# CategoryEncoding - integer ke one-hot atau multi-hot
encoder = keras.layers.CategoryEncoding(
    num_tokens=5,
    output_mode='one_hot'
)

categories = tf.constant([0, 1, 2, 3, 4])
encoded = encoder(categories)

print("One-hot encoded:")
print(encoded.numpy())

One-hot encoded:
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]


In [36]:
# IntegerLookup dengan vocabulary dari data
integer_lookup = keras.layers.IntegerLookup()
integer_lookup.adapt([1, 2, 3, 1, 2, 1])  # Learn vocabulary

data = tf.constant([1, 2, 3, 4])  # 4 is unknown
result = integer_lookup(data)

print(f"Input: {data.numpy()}")
print(f"Encoded: {result.numpy()}")
print(f"Vocabulary: {integer_lookup.get_vocabulary()}")

Input: [1 2 3 4]
Encoded: [1 2 3 0]
Vocabulary: [-1, np.int64(1), np.int64(2), np.int64(3)]


### 8.3 Text Preprocessing

In [37]:
# TextVectorization layer
text_vectorizer = keras.layers.TextVectorization(
    max_tokens=100,
    output_mode='int',
    output_sequence_length=10
)

# Adapt to text data
texts = [
    "TensorFlow is great for deep learning",
    "Keras makes neural networks easy",
    "Deep learning is the future"
]
text_vectorizer.adapt(texts)

# Vectorize
vectorized = text_vectorizer(texts)

print("Vocabulary (first 20):")
print(text_vectorizer.get_vocabulary()[:20])
print(f"\nVectorized shape: {vectorized.shape}")
print(f"First text vectorized: {vectorized[0].numpy()}")

Vocabulary (first 20):
['', '[UNK]', np.str_('learning'), np.str_('is'), np.str_('deep'), np.str_('the'), np.str_('tensorflow'), np.str_('neural'), np.str_('networks'), np.str_('makes'), np.str_('keras'), np.str_('great'), np.str_('future'), np.str_('for'), np.str_('easy')]

Vectorized shape: (3, 10)
First text vectorized: [ 6  3 11 13  4  2  0  0  0  0]


### 8.4 Image Preprocessing

In [38]:
# Image preprocessing layers
image_preprocess = keras.Sequential([
    keras.layers.Resizing(224, 224),
    keras.layers.Rescaling(1./255)
])

# Data augmentation layers
data_augmentation = keras.Sequential([
    keras.layers.RandomFlip('horizontal'),
    keras.layers.RandomRotation(0.1),
    keras.layers.RandomZoom(0.1),
    keras.layers.RandomContrast(0.1)
])

print("Image preprocessing layers:")
print("- Resizing: Resize images to fixed size")
print("- Rescaling: Normalize pixel values")
print("- RandomFlip: Random horizontal/vertical flip")
print("- RandomRotation: Random rotation")
print("- RandomZoom: Random zoom in/out")
print("- RandomContrast: Random contrast adjustment")

Image preprocessing layers:
- Resizing: Resize images to fixed size
- Rescaling: Normalize pixel values
- RandomFlip: Random horizontal/vertical flip
- RandomRotation: Random rotation
- RandomZoom: Random zoom in/out
- RandomContrast: Random contrast adjustment


### 8.5 Model dengan Preprocessing Layers

In [39]:
# Model dengan preprocessing layers terintegrasi
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# Load data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, random_state=42
)

# Create normalizer
normalizer = keras.layers.Normalization()
normalizer.adapt(X_train)

# Model dengan preprocessing di dalam
model = keras.Sequential([
    normalizer,  # Preprocessing layer
    keras.layers.Dense(30, activation='relu'),
    keras.layers.Dense(30, activation='relu'),
    keras.layers.Dense(1)
])

model.compile(loss='mse', optimizer='adam')

# Training - tidak perlu preprocessing manual!
history = model.fit(
    X_train, y_train,
    epochs=5,
    validation_split=0.2,
    verbose=1
)

print("\nModel dengan preprocessing layers selesai training!")

Epoch 1/5
[1m387/387[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 1.5291 - val_loss: 0.6027
Epoch 2/5
[1m387/387[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.4939 - val_loss: 0.4669
Epoch 3/5
[1m387/387[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.4210 - val_loss: 0.4378
Epoch 4/5
[1m387/387[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.3932 - val_loss: 0.4183
Epoch 5/5
[1m387/387[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.3785 - val_loss: 0.4072

Model dengan preprocessing layers selesai training!


## 9. TensorFlow Datasets (TFDS)

Library untuk mengakses dataset siap pakai.

In [40]:
# Install TFDS jika belum
%pip install tensorflow-datasets

import tensorflow_datasets as tfds
print(f"TFDS version: {tfds.__version__}")

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
TFDS version: 4.9.9


In [41]:
# List available datasets
try:
    import tensorflow_datasets as tfds
    
    # Beberapa dataset populer
    popular_datasets = [
        'mnist',
        'fashion_mnist',
        'cifar10',
        'cifar100',
        'imdb_reviews',
        'coco',
        'imagenet2012'
    ]
    
    print("Popular datasets in TFDS:")
    for ds in popular_datasets:
        print(f"  - {ds}")
except:
    print("TFDS not available")

Popular datasets in TFDS:
  - mnist
  - fashion_mnist
  - cifar10
  - cifar100
  - imdb_reviews
  - coco
  - imagenet2012


In [42]:
# Load dataset dengan TFDS
try:
    import tensorflow_datasets as tfds
    
    # Load MNIST
    dataset, info = tfds.load(
        'mnist',
        split='train',
        with_info=True,
        as_supervised=True  # Returns (image, label) tuple
    )
    
    print(f"Dataset: {info.name}")
    print(f"Description: {info.description[:100]}...")
    print(f"Features: {info.features}")
    print(f"Total examples: {info.splits['train'].num_examples}")
    
except Exception as e:
    print(f"Could not load TFDS: {e}")

  from .autonotebook import tqdm as notebook_tqdm


[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\HP Pavilion 15\tensorflow_datasets\mnist\3.0.1...[0m


Dl Completed...: 0 url [00:00, ? url/s]
Dl Completed...:   0%|          | 0/1 [00:00<?, ? url/s]
Dl Completed...: 100%|██████████| 1/1 [00:00<00:00, 140.08 url/s]
Dl Completed...: 100%|██████████| 1/1 [00:00<00:00, 94.86 url/s] 
Dl Completed...: 100%|██████████| 1/1 [00:00<00:00, 71.12 url/s]
Dl Completed...:  50%|█████     | 1/2 [00:00<00:00, 56.16 url/s]
Dl Completed...: 100%|██████████| 2/2 [00:00<00:00, 97.39 url/s]
Dl Completed...: 100%|██████████| 2/2 [00:00<00:00, 84.20 url/s]
Dl Completed...: 100%|██████████| 2/2 [00:00<00:00, 75.88 url/s]
Dl Completed...:  67%|██████▋   | 2/3 [00:00<00:00, 65.24 url/s]
Dl Completed...: 100%|██████████| 3/3 [00:00<00:00, 88.31 url/s]
Dl Completed...: 100%|██████████| 3/3 [00:00<00:00, 81.36 url/s]
Dl Completed...: 100%|██████████| 3/3 [00:00<00:00, 75.42 url/s]
Dl Completed...:  75%|███████▌  | 3/4 [00:00<00:00, 68.32 url/s]
Dl Completed...: 100%|██████████| 4/4 [00:00<00:00, 85.70 url/s]
Dl Completed...: 100%|██████████| 4/4 [00:00<00:00, 80.5

[1mDataset mnist downloaded and prepared to C:\Users\HP Pavilion 15\tensorflow_datasets\mnist\3.0.1. Subsequent calls will reuse this data.[0m
Dataset: mnist
Description: The MNIST database of handwritten digits....
Features: FeaturesDict({
    'image': Image(shape=(28, 28, 1), dtype=uint8),
    'label': ClassLabel(shape=(), dtype=int64, num_classes=10),
})
Total examples: 60000




In [43]:
# Contoh pipeline dengan TFDS
try:
    import tensorflow_datasets as tfds
    
    def preprocess_mnist(image, label):
        image = tf.cast(image, tf.float32) / 255.0
        return image, label
    
    # Load dan preprocess
    train_ds = tfds.load('mnist', split='train', as_supervised=True)
    train_ds = train_ds.map(preprocess_mnist)
    train_ds = train_ds.shuffle(10000)
    train_ds = train_ds.batch(32)
    train_ds = train_ds.prefetch(tf.data.AUTOTUNE)
    
    print("TFDS pipeline ready!")
    
except Exception as e:
    print(f"TFDS example skipped: {e}")

TFDS pipeline ready!


## 10. Kesimpulan

### Key Takeaways:

1. **tf.data API** menyediakan pipeline yang efisien:
   - `from_tensor_slices()`: Buat dataset dari tensors
   - `map()`: Apply transformations
   - `filter()`: Filter elements
   - `batch()`: Create batches
   - `shuffle()`: Randomize order
   - `prefetch()`: Overlap loading dan training

2. **Optimal Pipeline Pattern**:
   ```python
   dataset = tf.data.Dataset.from_tensor_slices((X, y))
   dataset = dataset.shuffle(buffer_size)
   dataset = dataset.map(preprocess_fn, num_parallel_calls=AUTOTUNE)
   dataset = dataset.batch(batch_size)
   dataset = dataset.prefetch(AUTOTUNE)
   ```

3. **TFRecord** untuk large datasets:
   - Binary format yang efisien
   - Support compression
   - Optimal untuk distributed training

4. **Keras Preprocessing Layers**:
   - `Normalization`: Standardize numerical features
   - `StringLookup`: String to integer encoding
   - `CategoryEncoding`: One-hot/multi-hot encoding
   - `TextVectorization`: Text to integers
   - `Resizing`, `Rescaling`: Image preprocessing
   - `RandomFlip`, `RandomRotation`: Data augmentation

5. **TensorFlow Datasets (TFDS)**:
   - Library dataset siap pakai
   - Easy loading dengan `tfds.load()`
   - Built-in preprocessing



In [44]:
# Cleanup temporary files
import os

temp_files = [
    'sample_data.csv',
    'sample_data.tfrecord',
    'sample_data.tfrecord.gz',
    'data_part0.csv',
    'data_part1.csv',
    'data_part2.csv'
]

for f in temp_files:
    if os.path.exists(f):
        os.remove(f)

print("Temporary files cleaned up!")

Temporary files cleaned up!
