### Convolutional Layers

In [1]:
from sklearn.datasets import load_sample_images
import tensorflow as tf

images = load_sample_images()["images"]
images = tf.keras.layers.CenterCrop(height=70, width=120)(images)
images = tf.keras.layers.Rescaling(scale=1 / 255)(images)

In [3]:
images.shape

TensorShape([2, 70, 120, 3])

^ 4D tensor two sample iamges, width, height, channel (red, green blue)

In [4]:
conv_layer = tf.keras.layers.Conv2D(filters=32, kernel_size=7)
fmaps = conv_layer(images)

In [5]:
fmaps.shape

TensorShape([2, 64, 114, 32])

OutputSize = InputSize − (KernelSize − 1)

^ height/width shrunk because we lose 3 pixels on each side, 32 feature maps

In [6]:
conv_layer = tf.keras.layers.Conv2D(filters=32, kernel_size=7, padding="same")
fmaps = conv_layer(images)
fmaps.shape

TensorShape([2, 70, 120, 32])

In [8]:
kernels, biases = conv_layer.get_weights()

In [9]:
kernels.shape

(7, 7, 3, 32)

In [10]:
biases.shape

(32,)

### Memory

if we have 200 5×5 filters and we have a 150×100 RGB image: each 200 contains 150×100 neurons: 200(5×5×3×150×100) = 225 million float

convolutional layer’s output will occupy 200×150×100×32 = 96 million bits (12 MB) of RAM, 100 instances = 1.2GB RAM

### Pooling

allows us to subsample an image (shrink) reducing computational load

people now only use max pooling, we can implement one as Keras does not include a depthwise pooling layer

In [11]:
class DepthPool(tf.keras.layers.Layer):
    def __init__(self, pool_size=2, **kwargs):
        super().__init__(**kwargs)
        self.pool_size = pool_size

    def call(self, inputs):
        shape = tf.shape(inputs)
        groups = shape[-1]
        new_shape = tf.concat([shape[:-1], [groups, self.pool_size]], axis=0)
        return tf.reduce_max(tf.reshape(inputs, new_shape), axis=-1)

CNN for Fashion MNIST dataset

In [14]:
from functools import partial

DefaultConv2D = partial(tf.keras.layers.Conv2D, kernel_size=3, padding="same", activation="relu", kernel_initializer="he_normal")
model = tf.keras.Sequential([
    DefaultConv2D(filters=64, kernel_size=7, input_shape=[28, 28, 1]),
    tf.keras.layers.MaxPool2D(),
    DefaultConv2D(filters=128),
    DefaultConv2D(filters=128),
    tf.keras.layers.MaxPool2D(),
    DefaultConv2D(filters=256),
    DefaultConv2D(filters=256),
    tf.keras.layers.MaxPool2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(units=128, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(units=64, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(units=10, activation="softmax")
])

It is a common practice to double the number of filters after each pooling layer: since a pooling layer divides each spatial dimension by a factor of 2

Data augmentation is useful when you have an unbalanced dataset, generate more samples of the less frequent classes. This is called the synthetic minority oversampling technique (SMOTE).

### ResNet-34 CNN

In [18]:
DefaultConv2D = partial(
    tf.keras.layers.Conv2D,
    kernel_size=3,
    strides=1,
    padding="same",
    kernel_initializer="he_normal",
    use_bias=False
)

class ResidualUnit(tf.keras.layers.Layer):
    def __init__(self, filters, strides=1, activation="relu", **kwargs):
        super().__init__(**kwargs)
        self.activation = tf.keras.activations.get(activation)
        self.main_layers = [
            DefaultConv2D(filters, strides=strides),
            tf.keras.layers.BatchNormalization(),
            self.activation,
            DefaultConv2D(filters),
            tf.keras.layers.BatchNormalization()
        ]
        self.skip_layers = []
        if strides > 1:
            self.skip_layers = [
                DefaultConv2D(filters, kernel_size=1, strides=strides),
                tf.keras.layers.BatchNormalization()
            ]

    def call(self, inputs):
        Z = inputs
        for layer in self.main_layers:
            Z = layer(Z)
        skip_Z = inputs
        for layer in self.skip_layers:
            skip_Z = layer(skip_Z)
        return self.activation(Z + skip_Z)

In [19]:
model = tf.keras.Sequential([
    DefaultConv2D(64, kernel_size=7, strides=2, input_shape=[224, 224, 3]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.MaxPool2D(pool_size=3, strides=2, padding="same"),
])
prev_filters = 64
for filters in [64] * 3 + [128] * 4 + [256] * 6 + [512] * 3:
    strides = 1 if filters == prev_filters else 2
    model.add(ResidualUnit(filters, strides=strides))
    prev_filters = filters

model.add(tf.keras.layers.GlobalAvgPool2D())
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(10, activation="softmax"))

We won't have to do this manually of course Keras comes with pre-trained models, make sure you have images that match the dimensions

In [23]:
# make sure your images are 224 × 224
model = tf.keras.applications.ResNet50(weights="imagenet")

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels.h5
[1m102967424/102967424[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m137s[0m 1us/step


Resizing after cropping:

In [24]:
images = load_sample_images()["images"]
images_resized = tf.keras.layers.Resizing(height=224, width=224, crop_to_aspect_ratio=True)(images)

most models have a `preprocess_input()` function to preprocess images, some expect 0 to 1, or -1 to 1, here its simply 0 to 255

In [25]:
inputs = tf.keras.applications.resnet50.preprocess_input(images_resized)

In [26]:
y_proba = model.predict(inputs)
y_proba.shape

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 740ms/step


(2, 1000)

In [28]:
top_K = tf.keras.applications.resnet50.decode_predictions(y_proba, top=3)
for image_index in range(len(images)):
    print(f"Image #{image_index}")
    for class_id, name, y_proba in top_K[image_index]:
        print(f"  {class_id} - {name:12s} {y_proba:.2%}")

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/imagenet_class_index.json
[1m35363/35363[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1us/step
Image #0
  n03598930 - jigsaw_puzzle 30.68%
  n02782093 - balloon      17.17%
  n03888257 - parachute    5.57%
Image #1
  n04209133 - shower_cap   34.37%
  n09229709 - bubble       11.41%
  n02782093 - balloon      9.46%


### Pretrained models for transfer learning

if you don't have enough data to train from scratch we can use lower layers of a pretrained model. Example flower dataset:

In [35]:
import tensorflow_datasets as tfds

dataset, info = tfds.load("tf_flowers", as_supervised=True, with_info=True)
dataset_size = info.splits["train"].num_examples
class_names = info.features["label"].names
n_classes = info.features["label"].num_classes

In [36]:
test_set_raw, valid_set_raw, train_set_raw = tfds.load(
    "tf_flowers",
    split=["train[:10%]", "train[10%:25%]", "train[25%:]"],
    as_supervised=True
)

In [37]:
batch_size = 32
preprocess = tf.keras.Sequential([
    tf.keras.layers.Resizing(height=224, width=224, crop_to_aspect_ratio=True),
    tf.keras.layers.Lambda(tf.keras.applications.xception.preprocess_input)
])

train_set = train_set_raw.map(lambda X, y: (preprocess(X), y))
train_set = train_set.shuffle(1000, seed=42).batch(batch_size).prefetch(1)
valid_set = valid_set_raw.map(lambda X, y: (preprocess(X), y)).batch(batch_size)
test_set = test_set_raw.map(lambda X, y: (preprocess(X), y)).batch(batch_size)

32 images not very large we can use data augemntation to help

In [38]:
data_augmentation = tf.keras.Sequential([
    tf.keras.layers.RandomFlip(mode="horizontal", seed=42),
    tf.keras.layers.RandomRotation(factor=0.05, seed=42),
    tf.keras.layers.RandomContrast(factor=0.2, seed=42)
])

Xception model, pretrained on ImageNet, note the `include_top=False`

In [39]:
base_model = tf.keras.applications.xception.Xception(weights="imagenet", include_top=False)
avg = tf.keras.layers.GlobalAveragePooling2D()(base_model.output)
output = tf.keras.layers.Dense(n_classes, activation="softmax")(avg)
model = tf.keras.Model(inputs=base_model.input, outputs=output)

In [41]:
for layer in base_model.layers:
    layer.trainable = False

In [42]:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=3)

Epoch 1/3
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m74s[0m 849ms/step - accuracy: 0.7992 - loss: 1.0871 - val_accuracy: 0.7641 - val_loss: 1.1317
Epoch 2/3
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m70s[0m 816ms/step - accuracy: 0.8809 - loss: 0.5196 - val_accuracy: 0.8022 - val_loss: 0.8470
Epoch 3/3
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m72s[0m 835ms/step - accuracy: 0.8954 - loss: 0.4275 - val_accuracy: 0.8512 - val_loss: 0.7357


top layers are now well trained, lets unfreeze layers 56 and above:

In [43]:
for layer in base_model.layers[56:]:
    layer.trainable = True

compile model whenever freeze or unfreeze layers

In [46]:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=10)

Epoch 1/10
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m178s[0m 2s/step - accuracy: 0.8921 - loss: 0.3169 - val_accuracy: 0.8730 - val_loss: 0.5412
Epoch 2/10
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m174s[0m 2s/step - accuracy: 0.9677 - loss: 0.1009 - val_accuracy: 0.9220 - val_loss: 0.2717
Epoch 3/10
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m173s[0m 2s/step - accuracy: 0.9921 - loss: 0.0217 - val_accuracy: 0.9056 - val_loss: 0.3381
Epoch 4/10
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m175s[0m 2s/step - accuracy: 0.9931 - loss: 0.0239 - val_accuracy: 0.9201 - val_loss: 0.3091
Epoch 5/10
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m174s[0m 2s/step - accuracy: 0.9959 - loss: 0.0160 - val_accuracy: 0.9147 - val_loss: 0.3266
Epoch 6/10
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m174s[0m 2s/step - accuracy: 0.9954 - loss: 0.0160 - val_accuracy: 0.9074 - val_loss: 0.3712
Epoch 7/10
[1m86/86[0m [32m━━━━

Understanding where in the image are objects (localizing) can be expressed as a regression task to predict a bounding box: horizontal, vertical coordinates of center as well as width and height (4 items to predict)

In [48]:
base_model = tf.keras.applications.xception.Xception(weights="imagenet", include_top=False)
avg = tf.keras.layers.GlobalAveragePooling2D()(base_model.output)
class_output = tf.keras.layers.Dense(n_classes, activation="softmax")(avg)
loc_output = tf.keras.layers.Dense(4)(avg)
model = tf.keras.Model(inputs=base_model.input, outputs=[class_output, loc_output])

model.compile(
    loss=["sparse_categorical_crossentropy", "mse"],
    loss_weights=[0.8, 0.2], # depends on what you care most about
    optimizer=optimizer,
    metrics=["accuracy"]
)

The flower dataset does not have bounding boxes around the flowers, so we need to add them ourselves. This is often one of the hardest most costly parts of a machine learning project (getting the labels)

Look for tools that do this like VGG, Image Annotator, Amazon Mechanical Turk, etc. could also use a crowdsourcing platform.

Each item should be of the form tuple: `(images, (class_labels, bounding_boxes))`

Its better to predict the square root of the width and height rather than getting them directly:

a 10 pixel error for a large bounding box will not be penalized as much as a 10-pixel error for a small bounding box

### Intersection over union (IoU)

this metric measures how well our prediction for the bounding boxes went, overlap between predicted bounding box and target bounding box divided by the area of their union
`tf.keras.metrics.MeanIoU`

### Mean Average Precision (mAP)

Mean Average Precision (mAP) is a key metric for evaluating object detection models. It builds on precision and recall, which measure how well a model identifies objects correctly.

Average Precision (AP): Instead of taking precision at a fixed recall level, we compute the maximum precision at each recall threshold (0%, 10%, 20%, ..., 100%) and then average these values.

Mean Average Precision (mAP): When dealing with multiple object classes, we calculate AP for each class, then take the mean of all these AP values.

Bounding Box Accuracy (IoU Thresholds): In object detection, predictions must be both correct in class and correctly localized. We use Intersection over Union (IoU) to measure this.

CNNs are so vast and moving quickly, we have to explore more later (video, object segmentation, predicting next frame in video, combining text and images)

https://www.tensorflow.org/hub/tutorials/tf2_object_detection