<a href="https://colab.research.google.com/github/sweaterr/1_CODE/blob/master/TF2_Image_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [14]:
! pip install tensorflow==2.1.0



In [15]:
import tensorflow as tf
tf.__version__

'2.1.0'

# Autograph

Automatically converting Python code into its graphical representation is done with the use of AutoGraph. In TensorFlow 2.0, AutoGraph is automatically applied to a function when it is decorated with @tf.function. This decorator creates callable graphs from Python functions.

A function, once decorated correctly, is processed by tf.function and the tf.autograph module in order to convert it into its graphical representation. The following diagram shows a schematic representation of what happens when a decorated function is called:

![](https://learning.oreilly.com/library/view/hands-on-neural-networks/9781789615555/assets/ba5dc45f-0079-4085-b76d-2f6fb034fae8.png)

Schematic representation of what happens when a function, f, decorated with @tf.function, which is called on the first call and on any other subsequent call
On the first call of the annotated function, the following occurs:

1. The function is executed and traced. Eager execution is disabled in this context, and so every tf.* method defines a tf.Operation node that produces a tf.Tensor output, exactly like it does in TensorFlow 1.x.
1. The tf.autograph module is used to detect Python constructs that can be converted into their graph equivalent. The graph representation is built from the function trace and AutoGraph information. This is done in order to preserve the execution order that's defined in Python.
1. The tf.Graph object has now been built.
1. Based on the function name and the input parameters, a unique ID is created and associated with the graph. The graph is then cached into a map so that it can be reused when a second invocation occurs and the ID matches.

Converting a function into its graph representation usually requires us to think; in TensorFlow 1.x, not every function that works in eager mode can be converted painlessly into its graph version.

For instance, a variable in eager mode is a Python object that follows the Python rules regarding its scope. In graph mode, as we found out in the previous chapter, a variable is a persistent object that will continue to exist, even if its associated Python variable goes out of scope and is garbage-collected.

Therefore, special attention has to be placed on software design: if a function has to be graph-accelerated and it creates a status (using tf.Variable and similar objects), it is up to the developer to take care of avoiding having to recreate the variable every time the function is called.

For this reason, tf.function parses the function body multiple times while looking for the tf.Variable definition. If, at the second invocation, it finds out that a variable object is being recreated, it raises an exception:

```
ValueError: tf.function-decorated function tried to create variables on non-first call.
```

In practice, if we have defined a function that performs a simple operation that uses a tf.Variable inside it, we have to ensure that the object is only created once.

The following function works correctly in eager mode, but it fails to execute if it is decorated with @tf.function and is raising the preceding exception:

In [0]:
def f():
    a = tf.constant([[10,10],[11.,1.]])
    x = tf.constant([[1.,0.],[0.,1.]])
    b = tf.Variable(12.)
    y = tf.matmul(a, x) + b
    return y

Handling functions that create a state means that we have to rethink our usage of graph-mode. A state is a persistent object, such as a variable, and the variable can't be redeclared more than once. Due to this, the function definition can be changed in two ways:

By passing the variable as an input parameter
By breaking the function scope and inheriting a variable from the external scope
The first option requires changing the function definition that's making it:



In [15]:
@tf.function
def f(b):
    a = tf.constant([[10,10],[11.,1.]])
    x = tf.constant([[1.,0.],[0.,1.]])
    y = tf.matmul(a, x) + b
    return y

var = tf.Variable(12.)
print(f(var))
print(f(15))
print(f(tf.constant(1, tf.float32)))

tf.Tensor(
[[22. 22.]
 [23. 13.]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[25. 25.]
 [26. 16.]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[11. 11.]
 [12.  2.]], shape=(2, 2), dtype=float32)


f now accepts a Python input variable, b. This variable can be a tf.Variable, a tf.Tensor, and also a NumPy object or a Python type. Every time the input type changes, a new graph is created in order to make an accelerated version of the function that works for any required input type (this is required because of how a TensorFlow graph is statically typed).

The second option, on the other hand, requires breaking down the function scope, making the variable available outside the scope of the function itself. In this case, there are two paths we can follow:

* Not recommended: Use global variables
* Recommended: Use Keras-like objects
The first path, which is not recommended, consists of declaring the variable outside the function body and using it inside, ensuring that it will only be declared once:

In [16]:
b = None

@tf.function
def f():
    a = tf.constant([[10, 10], [11., 1.]])
    x = tf.constant([[1., 0.], [0., 1.]])
    global b
    if b is None:
        b = tf.Variable(12.)
    y = tf.matmul(a, x) + b
    return y

f()

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[22., 22.],
       [23., 13.]], dtype=float32)>

The second path, which is recommended, is to use an object-oriented approach and declare the variable as a private attribute of a class. Then, you need to make the objects that were instantiated callable by putting the function body inside the __call__ method:

In [17]:
class F():
    def __init__(self):
        self._b = None

    @tf.function
    def __call__(self):
        a = tf.constant([[10, 10], [11., 1.]])
        x = tf.constant([[1., 0.], [0., 1.]])
        if self._b is None:
            self._b = tf.Variable(12.)
        y = tf.matmul(a, x) + self._b
        return y

f = F()
f()

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[22., 22.],
       [23., 13.]], dtype=float32)>

AutoGraph and the graph acceleration process work best when it comes to optimizing the training process.

In fact, the most computationally-intensive part of the training is the forward pass, followed by gradient computation and parameter updates. In the previous example, following the new structure that the absence of tf.Session allows us to follow, we separate the training step from the training loop. The training step is a function without a state that uses variables inherited from the outer scope. Therefore, it can be converted into its graph representation and accelerated just by decorating it with the @tf.function decorator:

In [0]:
@tf.function
def train_step(inputs, labels):
  # function body
  pass

You are invited to measure the speedup that was introduced by the graph conversion of the train_step function.

The speedup is not guaranteed since eager execution is already fast and there are simple scenarios in which eager execution is as fast as its graphical counterpart. However, the performance boost is visible when the models become more complex and deeper.
AutoGraph automatically converts Python constructs into their tf.* equivalent, but since converting source code that preserves semantics is not an easy task, there are scenarios in which it is better to help AutoGraph perform source code transformation.

In fact, there are constructs that work in eager execution that are already drop-in replacements for Python constructs. In particular, tf.range replaces range, tf.print replaces print, and tf.assert replaces assert.

For instance, AutoGraph is not able to automatically convert print into tf.print in order to preserve its semantic. Therefore, if we want a graph-accelerated function to print something when executed in graph mode, we have to write the function using tf.print instead of print.

You are invited to define simple functions that use tf.range instead of  range and print instead of tf.print, and then visualize how the source code is converted using the tf.autograph module.

For instance, take a look at the following code:

In [18]:
import tensorflow as tf

@tf.function
def f():
    x = 0
    for i in range(10):
        print(i)
        x += i
    return x


f()
print(tf.autograph.to_code(f.python_function))

0
1
2
3
4
5
6
7
8
9
def tf__f_1():
  do_return = False
  retval_ = ag__.UndefinedReturnValue()
  with ag__.FunctionScope('f', 'fscope', ag__.ConversionOptions(recursive=True, user_requested=True, optional_features=(), internal_convert_user_code=True)) as fscope:
    x = 0

    def get_state():
      return ()

    def set_state(_):
      pass

    def loop_body(iterates, x):
      i = iterates
      print(i)
      x += i
      return x,
    x, = ag__.for_stmt(ag__.converted_call(range, (10,), None, fscope), None, loop_body, get_state, set_state, (x,), ('x',), ())
    do_return = True
    retval_ = fscope.mark_return_value(x)
  do_return,
  return ag__.retval(retval_)



This produces 0,1,2, ..., 10 when f is called—does this happens every time f is invoked, or only the first time?

You are invited to carefully read through the following AutoGraph-generated function (this is machine-generated, and so it is hard to read) in order to understand why f behaves in this way:

In [0]:
def tf__f():
  try:
    with ag__.function_scope('f'):
      do_return = False
      retval_ = None
      x = 0

      def loop_body(loop_vars, x_1):
        with ag__.function_scope('loop_body'):
          i = loop_vars
          with ag__.utils.control_dependency_on_returns(ag__.print_(i)):
            x, i_1 = ag__.utils.alias_tensors(x_1, i)
            x += i_1
            return x,
      x, = ag__.for_stmt(ag__.range_(10), None, loop_body, (x,))
      do_return = True
      retval_ = x
      return retval_
  except:
    ag__.rewrite_graph_construction_error(ag_source_map__)

Migrating an old codebase from Tensorfow 1.x to 2.0 can be a time-consuming process. This is why the TensorFlow authors created a conversion tool that allows us to automatically migrate the source code (it even works on Python notebooks!).

#  The tf.data.Dataset object


In [0]:
import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices({
    "a":
    tf.random.uniform([4]),
    "b":
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)
})
for value in dataset:
    print(value["a"])

tf.Tensor(0.38550448, shape=(), dtype=float32)
tf.Tensor(0.07681441, shape=(), dtype=float32)
tf.Tensor(0.63705075, shape=(), dtype=float32)
tf.Tensor(0.65296555, shape=(), dtype=float32)


In [0]:
def noise():
    while True:
        yield tf.random.uniform((100,))


dataset = tf.data.Dataset.from_generator(noise, (tf.float32))


0
(32, 100)
1
(32, 100)


The only peculiarity of the from_generator method is the need to pass the type of the parameters (tf.float32, in this case) as the second parameter; this is required since to build a graph we need to know the type of the parameters in advance.

Using method chaining, it is possible to create new dataset objects, transforming the one just built to get the data our machine learning model expects as input. For example, if we want to sum 10 to every component of the noise vector, shuffle the dataset content, and create batches of 32 vectors each, we can do so by calling just three methods:

In [0]:
buffer_size = 10
batch_size = 32
dataset = dataset.map(lambda x: x + 10).shuffle(buffer_size).batch(batch_size)
for idx, noise in enumerate(dataset):
    if idx == 2:
        break
    print(idx)
    print(noise.shape)

0
(32, 32, 100)
1
(32, 32, 100)


The map method is the most widely used method of the `tf.data.Dataset` object since it allows us to apply a function to every element of the input dataset, producing a new, transformed dataset.

The `shuffle` method is used in every training pipeline since this transformation randomly shuffles the input dataset using a fixed-sized buffer; this means that the shuffled data first fetches the buffer_size element from its input, then shuffles them and produces the output.

The `batch` method gathers the `batch_size` elements from its input and creates a batch as output. The only constraint of this transformation is that all elements of the batch must have the same shape.

To train a model, it has to be fed with all the elements of the training set for multiple epochs. The `tf.data.Dataset` class offers the `repeat(num_epochs)` method to do this. Thus, the input data pipeline can be summarized as shown in the following diagram:

![](https://learning.oreilly.com/library/view/hands-on-neural-networks/9781789615555/assets/640c3fdc-3c31-47b8-8380-1f321542659d.png)

The diagram shows the typical data input pipeline: the transformation from raw data to data ready to be used by the model, just by chaining method calls. 

Prefetching and caching are optimization tips that are explained in the next section.

Please note that until not a single word has been said about the concept of thread, synchronization, or remote filesystems.

All this is hidden by the tf.data API:

* The input paths (for example, when using the `tf.data.Dataset.list_files` method) can be remote. TensorFlow internally uses the `tf.io.gfile` package, which is a file input/output wrapper without thread locking. This module makes it possible to read from a local filesystem or a remote filesystem in the same way. For instance, it is possible to read from a Google Cloud Storage bucket by using its address in the `gs://bucket/` format, without the need to worry about authentication, remote requests, and all the boilerplate required to work with a remote filesystem.
* Every transformation applied to the data is executed using all the CPU resources efficiently—a number of threads equal to the number of CPU cores are created together with the dataset object and are used to process the data sequentially and in parallel whenever parallel transformation is possible.
* The synchronization among these threads is all managed by the `tf.data` API.

All the transformations described by chaining method calls are executed by threads on the CPU that `tf.data.Dataset` instantiates to perform operations that can be executed in parallel automatically, which is a great performance boost.

Furthermore, `tf.data.Dataset` is high-level enough to make invisible all the threads execution and synchronization, but the automated solution can be suboptimal: the target device could be not completely used, and it is up to the user to remove the bottlenecks to reach the 100% usage of the target devices.

# Performance optimizations

The `tf.data` API as shown so far describes a sequential data input pipeline that transforms the data from a raw to a useful format by applying transformations.

All these operations are executed on the CPU while the target device (CPUs, TPUs, or, in general, the consumer) waits for the data. If the target device consumes the data faster than it is produced, there will be moments of 0% utilization of the target devices.

In parallel programming, this problem has been solved by using prefetching.

### Prefetching
When the consumer is working, the producer shouldn't be idle but must work in the background to produce the data the consumer will need in the next iteration.

The `tf.data` API offers the prefetch(n) method to apply a transformation that allows overlapping the work of the producer and the consumer. The best practice is adding prefetch(n) at the end of the input pipeline to overlap the transformation performed on the CPU with the computation done on the target.

Choosing n is easy: n is the number of elements consumed by a training step, and since the vast majority of models are trained using batches of data, one batch per training step, then n=1.

The process of reading from disks, especially if reading big files, reading from slow HDDs, or using remote filesystems can be time-consuming. Caching is often used to reduce this overhead.

### Cache elements

The cache transformation can be used to cache the data in memory, completely removing the accesses to the data sources. This can bring huge benefits when using remote filesystems, or when the reading process is slow. Caching data after the first epoch is only possible if the data can fit into memory.

The cache method acts as a barrier in the transformation pipeline: everything executed before the cache method is executed only once, thus placing this transformation in the pipeline can bring immense benefits. In fact, it can be applied after a computationally intensive transformation or after any slow process to speed up everything that comes next.


# Building your dataset

The following example shows how to build a `tf.data.Dataset` object using the Fashion-MNIST dataset. This is the first complete example of a dataset that uses all the best practices described previously; please take the time to understand why the method chaining is performed in this way and where the performance optimizations have been applied.

In the following code, we define the `train_dataset` function, which returns the `tf.data.Dataset` object ready to use:


In [0]:
import tensorflow as tf 
from tensorflow.keras.datasets import fashion_mnist 
 
 
def train_dataset(batch_size=32, num_epochs=1): 
    (train_x, train_y), (test_x, test_y) = fashion_mnist.load_data()
    input_x, input_y = train_x, train_y 

    def scale_fn(image, label): 
        return (tf.image.convert_image_dtype(image, tf.float32) - 0.5) * 2.0, label 
 
    dataset = tf.data.Dataset.from_tensor_slices( 
        (tf.expand_dims(input_x, -1), tf.expand_dims(input_y, -1)) 
    ).map(scale_fn) 
 
    dataset = dataset.cache().repeat(num_epochs)
    dataset = dataset.shuffle(batch_size)
 
    return dataset.batch(batch_size).prefetch(1)

A training dataset, however, should contain augmented data in order to address the overfitting problem. Applying data augmentation on image data is straightforward using the TensorFlow `tf.image` package.

# Data augmentation
The ETL process defined so far only transforms the raw data, applying transformations that do not change the image content. Data augmentation, instead, requires to apply meaningful transformation the raw data with the aim of creating a bigger dataset and train, thus, a model more robust to these kinds of variations.

Working with images, it is possible to use the whole API offered by the tf.image package to augment the dataset. The augmentation step consists in the definition of a function and its application to the training set, using the dataset map method.

The set of valid transformations depends on the dataset—if we were using the MNIST dataset, for instance, flipping the input image upside down won't be a good idea (nobody wants to feed an image of the number 6 labeled as 9), but since we are using the fashion-MNIST dataset we can flip and rotate the input image as we like (a pair of trousers remains a pair of trousers, even if randomly flipped or rotated).

The tf.image package already contains functions with stochastic behavior, designed for data augmentation. These functions apply the transformation to the input image with a 50% chance; this is the desired behavior since we want to feed the model with both original and augmented images. Thus, a function that applies meaningful transformations to the input data can be defined as follows:

In [0]:
def augment(image):
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_flip_up_down(image)
    image = tf.image.random_brightness(image, max_delta=0.1)
    return image

In [24]:
import tensorflow_datasets as tfds

# See available datasets
print(tfds.list_builders())
# Construct 2 tf.data.Dataset objects
# The training dataset and the test dataset
ds_train, ds_test = tfds.load(name="mnist", split=["train", "test"])
builder = tfds.builder("mnist")
print(builder.info)

['abstract_reasoning', 'aeslc', 'aflw2k3d', 'amazon_us_reviews', 'bair_robot_pushing_small', 'big_patent', 'bigearthnet', 'billsum', 'binarized_mnist', 'binary_alpha_digits', 'c4', 'caltech101', 'caltech_birds2010', 'caltech_birds2011', 'cars196', 'cassava', 'cats_vs_dogs', 'celeb_a', 'celeb_a_hq', 'chexpert', 'cifar10', 'cifar100', 'cifar10_1', 'cifar10_corrupted', 'citrus_leaves', 'clevr', 'cmaterdb', 'cnn_dailymail', 'coco', 'coil100', 'colorectal_histology', 'colorectal_histology_large', 'curated_breast_imaging_ddsm', 'cycle_gan', 'deep_weeds', 'definite_pronoun_resolution', 'diabetic_retinopathy_detection', 'dmlab', 'downsampled_imagenet', 'dsprites', 'dtd', 'duke_ultrasound', 'dummy_dataset_shared_generator', 'dummy_mnist', 'emnist', 'esnli', 'eurosat', 'fashion_mnist', 'flores', 'food101', 'gap', 'gigaword', 'glue', 'groove', 'higgs', 'horses_or_humans', 'i_naturalist2017', 'image_label_folder', 'imagenet2012', 'imagenet2012_corrupted', 'imagenet_resized', 'imdb_reviews', 'iris'

local data directory. If you'd instead prefer to read directly from our public
GCS bucket (recommended if you're running on GCP), you can instead set
data_dir=gs://tfds-data/datasets.



HBox(children=(IntProgress(value=0, description='Dl Completed...', max=19, style=ProgressStyle(description_wid…



[1mDataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/1.0.0. Subsequent calls will reuse this data.[0m
tfds.core.DatasetInfo(
    name='mnist',
    version=1.0.0,
    description='The MNIST database of handwritten digits.',
    homepage='http://yann.lecun.com/exdb/mnist/',
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    total_num_examples=70000,
    splits={
        'test': 10000,
        'train': 60000,
    },
    supervised_keys=('image', 'label'),
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist},
      volume={2},
      year={2010}
    }""",
    redistribution_info=,
)



# Eager integration

The tf.data.Dataset object is iterable, which means one can either enumerate its elements using a for loop or create a Python iterator using the iter keyword. Please note that being iterable does not imply being a Python iterator as pointed out at the beginning of this chapter.

Iterating over a dataset object is extremely easy: we can use the standard Python for loop to extract a batch at each iteration.

Configuring the input pipeline by using a dataset object is a better solution than the one used so far.

The manual process of extracting elements from a dataset by computing the indices is error-prone and inefficient, while the tf.data.Dataset objects are highly-optimized. Moreover, the dataset objects are fully compatible with tf.function, and therefore the whole training loop can be graph-converted and accelerated.

Furthermore, the lines of code get reduced a lot, increasing the readability. The following code block represents the graph-accelerated (via @tf.function) custom training loop from the previous chapter, Chapter 4, TensorFlow 2.0 Architecture; the loop uses the train_dataset function defined previously:

In [0]:
import tensorflow as tf
from tensorflow.keras.datasets import fashion_mnist


def train_dataset(batch_size=32, num_epochs=1):
    (train_x, train_y), (test_x, test_y) = fashion_mnist.load_data()
    input_x, input_y = train_x, train_y

    def scale_fn(image, label):
        return (
            tf.image.convert_image_dtype(image, tf.float32) - 0.5) * 2.0, label

    dataset = tf.data.Dataset.from_tensor_slices((tf.expand_dims(
        input_x, -1), tf.expand_dims(input_y, -1))).map(scale_fn)

    dataset = dataset.cache().repeat(num_epochs)
    dataset = dataset.shuffle(batch_size)

    return dataset.batch(batch_size).prefetch(1)


def make_model(n_classes):
    return tf.keras.Sequential([
        tf.keras.layers.Conv2D(
            32, (5, 5), activation=tf.nn.relu, input_shape=(28, 28, 1)),
        tf.keras.layers.MaxPool2D((2, 2), (2, 2)),
        tf.keras.layers.Conv2D(64, (3, 3), activation=tf.nn.relu),
        tf.keras.layers.MaxPool2D((2, 2), (2, 2)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(1024, activation=tf.nn.relu),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(n_classes)
    ])


def train():
    # Define the model
    n_classes = 10
    model = make_model(n_classes)

    # Input data
    dataset = train_dataset(num_epochs=10)

    # Training parameters
    loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
    step = tf.Variable(1, name="global_step")
    optimizer = tf.optimizers.Adam(1e-3)
    accuracy = tf.metrics.Accuracy()

    # Train step function
    @tf.function
    def train_step(inputs, labels):
        with tf.GradientTape() as tape:
            logits = model(inputs)
            loss_value = loss(labels, logits)

        gradients = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        step.assign_add(1)

        accuracy_value = accuracy(labels, tf.argmax(logits, -1))
        return loss_value, accuracy_value

    @tf.function
    def loop():
        for features, labels in dataset:
            loss_value, accuracy_value = train_step(features, labels)
            if tf.equal(tf.math.floormod(step, 10), 0):
                tf.print(step, ": ", loss_value, " - accuracy: ",
                         accuracy_value)

    loop()

train()

10 :  1.4840436  - accuracy:  0.350694448
20 :  0.793415368  - accuracy:  0.508223712
30 :  0.767462611  - accuracy:  0.571120679
40 :  0.835911274  - accuracy:  0.607371807
50 :  0.308491945  - accuracy:  0.638392866
60 :  0.847064912  - accuracy:  0.661546588
70 :  0.513512  - accuracy:  0.677536249
80 :  0.64157629  - accuracy:  0.689477861
90 :  0.480841696  - accuracy:  0.695926964
100 :  0.244961098  - accuracy:  0.70612371
110 :  0.510952771  - accuracy:  0.711295843
120 :  0.373629093  - accuracy:  0.720063031
130 :  0.224795282  - accuracy:  0.727470934
140 :  0.895719826  - accuracy:  0.735611498
150 :  0.501918197  - accuracy:  0.744966447
160 :  0.755409122  - accuracy:  0.748820782
170 :  0.620160401  - accuracy:  0.748520732
180 :  0.562458873  - accuracy:  0.751396656
190 :  0.525883555  - accuracy:  0.755787
200 :  0.483723462  - accuracy:  0.758951
210 :  0.513407707  - accuracy:  0.762410283
220 :  0.576628566  - accuracy:  0.765268266
230 :  0.510008097  - accuracy: 

You are invited to read the source code carefully and compare it with the custom training loop from the previous chapter, Chapter 4, TensorFlow 2.0 Architecture.

# Image Classification Using TensorFlow Hub

We have discussed the image classification task in all of the previous chapters of this book. We have seen how it is possible to define a convolutional neural network by stacking several convolutional layers and how to train it using Keras. We also looked at eager execution and saw that using AutoGraph is straightforward.

이 책의 모든 이전 장에서 이미지 분류 작업에 대해 논의했습니다. 우리는 여러 개의 컨볼 루션 레이어를 쌓아 컨볼 루션 신경망을 정의하는 방법과 Keras를 사용하여이를 훈련시키는 방법을 보았습니다. 우리는 또한 간절한 실행을보고 AutoGraph를 사용하는 것이 간단하다는 것을 알았습니다.

So far, the convolutional architecture used has been a LeNet-like architecture, with an expected input size of 28 x 28, trained end to end every time to make the network learn how to extract the correct features to solve the fashion-MNIST classification task.

지금까지 사용 된 컨벌루션 아키텍처는 Lenet과 유사한 아키텍처로, 예상 입력 크기는 28 x 28이며, 네트워크가 패션 -MNIST 분류 작업을 해결하기 위해 올바른 기능을 추출하는 방법을 배우도록 매번 훈련되었습니다. .

Building a classifier from scratch, defining the architecture layer by layer, is an excellent didactical exercise that allows you to experiment with how different layer configurations can change the network performance. However, in real-life scenarios, the amount of data available to train a classifier is often limited. Gathering clean and correctly labeled data is a time-consuming process, and collecting a dataset with thousands of samples is tough. Moreover, even when the dataset size is adequate (thus, we are in a big data regime), training a classifier on it is a slow process; the training process might require several hours of GPU time since architectures more complicated than our LeNet-like architecture are necessary to achieve satisfactory results. Different architectures have been developed over the years, all of them introducing some novelties that have allowed the correct classification of color images with a resolution higher than 28 x 28.

계층별로 아키텍처를 정의하는 분류기를 처음부터 작성하는 것은 다른 계층 구성이 네트워크 성능을 어떻게 변경할 수 있는지 실험 해 볼 수있는 훌륭한 실습입니다. 그러나 실제 시나리오에서는 분류자를 훈련시키는 데 사용할 수있는 데이터의 양이 종종 제한됩니다. 깨끗하고 올바르게 레이블이 지정된 데이터를 수집하는 것은 시간이 많이 걸리는 프로세스이며 수천 개의 샘플이 포함 된 데이터 세트를 수집하는 것은 어렵습니다. 더욱이, 데이터 세트 크기가 충분하더라도 (따라서 우리는 큰 데이터 체제에있다), 분류기를 훈련시키는 것은 느린 과정이다. 만족스러운 결과를 얻으려면 LeNet과 유사한 아키텍처보다 복잡한 아키텍처가 필요하기 때문에 훈련 과정에는 몇 시간의 GPU 시간이 필요할 수 있습니다. 수년에 걸쳐 다양한 아키텍처가 개발되었으며, 모두 28 x 28보다 높은 해상도로 컬러 이미지를 올바르게 분류 할 수있는 몇 가지 참신함을 소개합니다.

Academia and industry release new classification architectures to improve the state of the art year on year. Their performance for an image classification task is measured by looking at the top-1 accuracy reached by the architecture when trained and tested on massive datasets such as ImageNet.

학계와 산업계는 매년 최신 상태를 개선하기 위해 새로운 분류 아키텍처를 출시합니다. 이미지 분류 작업의 성능은 ImageNet과 같은 대규모 데이터 세트에서 교육 및 테스트 할 때 아키텍처가 달성 한 최고의 정확도를 확인하여 측정됩니다.

ImageNet is a dataset of over 15 million high-resolution images with more than 22,000 categories, all of them manually labeled. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC ) is a yearly object detection and classification challenge that uses a subset of ImageNet of 1,000 images for 1,000 categories. The dataset used for the computation is made up of roughly 1.2 million training images, 50,000 validation images, and 100,000 testing images.

ImageNet은 22,000 개가 넘는 범주를 가진 1,500 만 개가 넘는 고해상도 이미지의 데이터 세트로, 모두 수동으로 레이블이 지정됩니다. ImageNet Largescale Visual Recognition Challenge (ILSVRC)는 1,000 개 범주에 대해 1,000 개 이미지의 ImageNet 하위 집합을 사용하는 연간 개체 감지 및 분류 문제입니다. 계산에 사용 된 데이터 세트는 대략 120 만 개의 훈련 이미지, 50,000 개의 검증 이미지 및 100,000 개의 테스트 이미지로 구성됩니다.

To achieve impressive results on an image classification task, researchers found that deep architectures were needed. This approach has a drawback—the deeper the network, the higher the number of parameters to train. But a higher number of parameters implies that a lot of computing power is needed (and computing power costs!). Since academia and industry have already developed and trained their models, why don't we take advantage of their work to speed up our development without reinventing the wheel every time?

이미지 분류 작업에서 인상적인 결과를 달성하기 위해 연구원들은 깊은 아키텍처가 필요하다는 것을 발견했습니다. 이 방법에는 네트워크가 깊을수록 학습 할 매개 변수 수가 더 많다는 단점이 있습니다. 그러나 더 많은 수의 매개 변수는 많은 컴퓨팅 성능이 필요하다는 것을 의미합니다 (및 컴퓨팅 성능 비용). 학계와 산업계에서 이미 모델을 개발하고 교육했기 때문에 매번 바퀴를 다시 만들지 않고 개발 속도를 높이기 위해 작업을 활용하지 않겠습니까?

In this chapter, we'll discuss transfer learning and fine-tuning, showing how they can speed up development. TensorFlow Hub is used as a tool to quickly get the models we need and speed up development.

이 장에서는 전이 학습 및 미세 조정에 대해 논의하고 이들이 개발 속도를 높이는 방법을 보여줍니다. TensorFlow Hub는 필요한 모델을 신속하게 얻고 개발 속도를 높이는 도구로 사용됩니다.

By the end of this chapter, you will know how to transfer the knowledge embedded in a model to a new task, using TensorFlow Hub easily, thanks to its Keras integration.

이 장을 마치면 Keras 통합 덕분에 TensorFlow Hub를 사용하여 모델에 포함 된 지식을 새로운 작업으로 이전하는 방법을 알 수 있습니다.

In this chapter, we will cover the following topics:

* Getting the data
* Transfer learning
* Fine-tuning

## Getting the data

The task we are going to solve in this chapter is a classification problem on a dataset of flowers, which is available in tensorflow-datasets (tfds). The dataset's name is tf_flowers and it consists of images of five different flower species at different resolutions. Using tfds, gathering the data is straightforward, and we can get the dataset's information by looking at the info variable returned by the tfds.load invocation, as shown here:

In [43]:
import tensorflow_datasets as tfds

dataset, info = tfds.load("tf_flowers", with_info=True)
print(info)

tfds.core.DatasetInfo(
    name='tf_flowers',
    version=1.0.0,
    description='A large set of images of flowers',
    homepage='https://www.tensorflow.org/tutorials/load_data/images',
    features=FeaturesDict({
        'image': Image(shape=(None, None, 3), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=5),
    }),
    total_num_examples=3670,
    splits={
        'train': 3670,
    },
    supervised_keys=('image', 'label'),
    citation="""@ONLINE {tfflowers,
    author = "The TensorFlow Team",
    title = "Flowers",
    month = "jan",
    year = "2019",
    url = "http://download.tensorflow.org/example_images/flower_photos.tgz" }""",
    redistribution_info=,
)



The preceding code produces the following dataset description:



In [44]:
dataset = dataset["train"]
tot = 3670

train_set_size = tot // 2
validation_set_size = tot - train_set_size - train_set_size // 2
test_set_size = tot - train_set_size - validation_set_size


print("train set size: ", train_set_size)
print("validation set size: ", validation_set_size)
print("test set size: ", test_set_size)

train, test, validation = (
    dataset.take(train_set_size),
    dataset.skip(train_set_size).take(validation_set_size),
    dataset.skip(train_set_size + validation_set_size).take(test_set_size),
)

train set size:  1835
validation set size:  918
test set size:  917


## Transfer learning

Only academia and some industries have the required budget and computing power to train an entire CNN from scratch, starting from random weights, on a massive dataset such as ImageNet.

Since this expensive and time-consuming work has already been done, it is a smart idea to reuse parts of the trained model to solve our classification problem.

In fact, it is possible to transfer what the network has learned from one dataset to a new one, thereby transferring the knowledge.

Transfer learning is the process of learning a new task by relying on a previously learned task: the learning process can be faster, more accurate, and require less training data.

The transfer learning idea is bright, and it can be successfully applied when using convolutional neural networks.

In fact, all convolutional architectures for classification have a fixed structure, and we can reuse parts of them as building blocks for our applications. The general structure is composed of three elements: 

* Input layer: The architecture is designed to accept an image with a precise resolution. The input resolution influences all of the architecture; if the input layer resolution is high, the network will be deeper.
* Feature extractor: This is the set of convolution, pooling, normalizations, and every other layer that is in between the input layer and the first dense layer. The architecture learns to summarize all the information contained in the input image in a low-dimensional representation (in the diagram that follows, an image with a size of 227 x 227 x 3 is projected into a 9216-dimensional vector).
* Classification layers: These are a stack of fully connected layers—a fully-connected classifier built on top of the low-dimensional representation of the input extracted by the classifier:

## TensorFlow Hub

The description of TensorFlow Hub that can be found on the official documentation describes what TensorFlow Hub is and what it's about pretty well:

TensorFlow Hub is a library for the publication, discovery, and consumption of reusable parts of machine learning models. A module is a self-contained piece of a TensorFlow graph, along with its weights and assets, that can be reused across different tasks in a process known as transfer learning. Transfer learning can:

- Train a model with a smaller dataset
- Improve generalization, and
- Speed up training
Thus, TensorFlow Hub is a library we can browse while a looking for a pre-trained model that best fits our needs. TensorFlow Hub comes both as a website we can browse (https://tfhub.dev) and as a Python package.

Installing the Python package allows us to have perfect integration with the modules loaded on TensorFlow Hub and TensorFlow 2.0:


In [0]:
! pip install tensorflow-hub>0.3

That is all we need to do to get access to a complete library of pre-trained models compatible and integrated with TensorFlow.

The TensorFlow 2.0 integration is terrific—we only need the URL of the module on TensorFlow Hub to create a Keras layer that contains the parts of the model we need!

Browsing the catalog on https://tfhub.dev is intuitive. The screenshot that follows shows how to use the search engine to find any module that contains the string tf2 (this is a fast way to find an uploaded module that is TensorFlow 2.0 compatible and ready to use):

The TensorFlow Hub website (https://tfhub.dev): it is possible to search for modules by query string (in this case, tf2) and refine the results by using the filter column on the left.
There are models in both versions: feature vector-only and classification, which means a feature vector plus the trained classification head. The TensorFlow Hub catalog already contains everything we need for transfer learning. In the next section, we will see how easy it is to integrate the Inception v3 module from TensorFlow Hub into TensorFlow 2.0 source code, thanks to the Keras API.

## Using Inception v3 as a feature extractor

The complete analysis of the Inception v3 architecture is beyond the scope of this book; however, it is worth noting some peculiarities of this architecture so as to use it correctly for transfer learning on a different dataset.

Inception v3 is a deep architecture with 42 layers, which won the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2015. Its architecture is shown in the following screenshot:

![](https://learning.oreilly.com/library/view/hands-on-neural-networks/9781789615555/assets/0ef8f6ae-c05d-4b57-9342-033cc828716f.png)

Inception v3 architecture. The model architecture is complicated and very deep. The network accepts a 299 x 299 x 3 image as input and produces an 8 x 8 x 2,048 feature map, which is the input of the final part; that is, a classifier trained on 1,000 +1 classes of ImageNet. Image source: https://cloud.google.com/tpu/docs/inception-v3-advanced.
The network expects an input image with a resolution of 299 x 299 x 3 and produces an 8 x 8 x 2,048 feature map. It has been trained on 1,000 classes of the ImageNet dataset, and the input images have been scaled in the [0,1] range.

All this information is available on the module page, reachable by clicking on the search result on the TensorFlow Hub website. Unlike the official architecture shown previously, on this page, we can find information about the extracted feature vector. The documentation says that it is a 2,048-feature vector, which means that the feature vector used is not the flattened feature map (that would have been an 8 * 8 * 2048 dimensional vector) but one of the fully-connected layers placed at the end of the network.

It is essential to know the expected input shape and the feature vector size to feed the network with correctly resized images and to attach the final layers, knowing how many connections there would be between the feature vector and the first fully-connected layer.

More importantly, it is necessary to know on which dataset the network was trained since transfer learning works well if the original dataset shares some features with the target (new) dataset. The following screenshot shows some samples gathered from the dataset used for the ILSVRC in 2015:

### Adapting data to the model

The information found on the module page also tells us that it is necessary to add a pre-processing step to the dataset split built earlier: the tf_flower images are tf.uint8, which means they are in the [0,255] range, while Inception v3 has been trained on images in the [0,1] range, which are thus tf.float32:

In [0]:
def to_float_image(example):
    example["image"] = tf.image.convert_image_dtype(example["image"], tf.float32)
    return example

Moreover, the Inception architecture requires a fixed input shape of 299 x 299 x 3. Therefore, we have to ensure that all our images are correctly resized to the expected input size:

In [0]:
def resize(example):
    example["image"] = tf.image.resize(example["image"], (299, 299))
    return example

All the required pre-processing operations have been defined, so we are ready to apply them to the train, validation, and test splits:


In [0]:
train = train.map(to_float_image).map(resize)
validation = validation.map(to_float_image).map(resize)
test = test.map(to_float_image).map(resize)

To summarize: the target dataset is ready; we know which model we want to use as a feature extractor;  the module information page told us that some preprocessing steps were required to make the data compatible with the model.

Everything is set up to design the classification model that uses Inception v3 as the feature extractor. In the next section, the extreme ease of use of the tensorflow-hub module is shown, thanks to its Keras integration.

## Building the model – hub.KerasLayer

The TensorFlow Hub Python package has already been installed, and this is all we need to do:

* Download the model parameters and graph description
* Restore the parameters in its graph
* Create a Keras layer that wraps the graph and allows us to use it like any other Keras layer we are used to using
These three points are executed under the hook of the KerasLayer tensorflow-hub function:


In [9]:
import tensorflow_hub as hub

hub.KerasLayer(
    "https://tfhub.dev/google/tf2-preview/inception_v3/feature_vector/2",
    output_shape=[2048],
    trainable=False)

<tensorflow_hub.keras_layer.KerasLayer at 0x7f8c25130da0>

The hub.KerasLayer function creates hub.keras_layer.KerasLayer, which is a tf.keras.layers.Layer object. Therefore, it can be used in the same way as any other Keras layer—this is powerful!

This strict integration allows us to define a model that uses the Inception v3 as a feature extractor and it has two fully connected layers as classification layers in very few lines:

In [0]:
num_classes = 5

model = tf.keras.Sequential(
    [
        hub.KerasLayer(
            "https://tfhub.dev/google/tf2-preview/inception_v3/feature_vector/2",
            output_shape=[2048],
            trainable=False,
        ),
        tf.keras.layers.Dense(512),
        tf.keras.layers.ReLU(),
        tf.keras.layers.Dense(num_classes), # linear
    ]
)

The model definition is straightforward, thanks to the Keras integration. Everything is set up to define the training loop, measure the performance, and see whether the transfer learning approach gives us the expected classification results.

Unfortunately, the process of downloading a pre-trained model from TensorFlow Hub is fast only on high-speed internet connections. A progress bar that shows the download progress is not enabled by default and, therefore, a lot of time could be required (depending on the internet speed) to build the model for the first time.

To enable a progress bar, using the TFHUB_DOWNLOAD_PROGRESS environment variable is required by hub.KerasLayer. Therefore, on top of the script, the following snippet can be added, which defines this environment variable and puts the value of 1 inside it; in this way, a handy progress bar will be shown on the first download:

In [0]:
import os
os.environ["TFHUB_DOWNLOAD_PROGRESS"] = "1"

## Training and evaluating

Using a pre-trained feature extractor allows us to speed up the training while keeping the training loop, the losses, and optimizers unchanged, using the same structure of every standard classifier train.

Since the dataset labels are tf.int64 scalars, the loss that is going to be used is the standard sparse categorical cross-entropy, setting the from_logits parameter to True. As seen in the previous chapter, Chapter 5, Efficient Data Input Pipelines and Estimator API, setting this parameter to True is a good practice since, in this way, it's the loss function itself that applies the softmax activation function, being sure to compute it in a numerically stable way, and thereby preventing the loss becoming NaN:

In [12]:
# Training utilities
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
step = tf.Variable(1, name="global_step", trainable=False)
optimizer = tf.optimizers.Adam(1e-3)

train_summary_writer = tf.summary.create_file_writer("./log/transfer/train")
validation_summary_writer = tf.summary.create_file_writer("./log/transfer/validation")

# Metrics
accuracy = tf.metrics.Accuracy()
mean_loss = tf.metrics.Mean(name="loss")

@tf.function
def train_step(inputs, labels):
    with tf.GradientTape() as tape:
        logits = model(inputs)
        loss_value = loss(labels, logits)

    gradients = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    step.assign_add(1)

    accuracy.update_state(labels, tf.argmax(logits, -1))
    return loss_value

# Configure the training set to use batches and prefetch
train = train.batch(32).prefetch(1)
validation = validation.batch(32).prefetch(1)
test = test.batch(32).prefetch(1)

num_epochs = 10
for epoch in range(num_epochs):

    for example in train:
        image, label = example["image"], example["label"]
        loss_value = train_step(image, label)
        mean_loss.update_state(loss_value)

        if tf.equal(tf.math.mod(step, 10), 0):
            tf.print(
                step, " loss: ", mean_loss.result(), " acccuracy: ", accuracy.result()
            )
            mean_loss.reset_states()
            accuracy.reset_states()

    # Epoch ended, measure performance on validation set
    tf.print("## VALIDATION - ", epoch)
    accuracy.reset_states()
    for example in validation:
        image, label = example["image"], example["label"]
        logits = model(image)
        accuracy.update_state(label, tf.argmax(logits, -1))
    tf.print("accuracy: ", accuracy.result())
    accuracy.reset_states()

10  loss:  1.4032892  acccuracy:  0.479166657
20  loss:  0.699153781  acccuracy:  0.715625
30  loss:  0.460169077  acccuracy:  0.81875
40  loss:  0.442007244  acccuracy:  0.81875
50  loss:  0.490934193  acccuracy:  0.815625
## VALIDATION -  0
accuracy:  0.813522339
60  loss:  0.43458119  acccuracy:  0.96875
70  loss:  0.495889515  acccuracy:  0.8375
80  loss:  0.314885676  acccuracy:  0.88125
90  loss:  0.272703886  acccuracy:  0.9
100  loss:  0.211857349  acccuracy:  0.925
110  loss:  0.283404648  acccuracy:  0.8875
## VALIDATION -  1
accuracy:  0.866957486
120  loss:  0.285670698  acccuracy:  0.90625
130  loss:  0.258526415  acccuracy:  0.903125
140  loss:  0.180402458  acccuracy:  0.934375
150  loss:  0.174977928  acccuracy:  0.9375
160  loss:  0.154107615  acccuracy:  0.946875
170  loss:  0.184987038  acccuracy:  0.9375
## VALIDATION -  2
accuracy:  0.877862573
180  loss:  0.199962854  acccuracy:  0.91875
190  loss:  0.135458723  acccuracy:  0.953125
200  loss:  0.108723208  acccur

After a single training epoch, we got a validation accuracy of 0.87, while the training accuracy was even lower (0.83). But by the end of the tenth epoch, the validation accuracy had even decreased (0.86), while the model was overfitting the training data.

In the Exercises section, you will find several exercises that use the previous code as a starting point; the overfitting problem should be tackled from several points of view, finding the best way to deal with it.

Before starting the next main section, it's worth adding a simple performance measurement that measures how much time is needed to compute a single training epoch.

## Training speed

Faster prototyping and training is one of the strengths of the transfer learning approach. One of the reasons behind the fact that transfer learning is often used in industry is the financial savings that it produces, reducing both the development and training time.

To measure the training time, the Python time package can be used. time.time() returns the current timestamp, allowing you to measure (in milliseconds) how much time is needed to perform a training epoch.

The training loop of the previous section can thus be extended by adding the time module import and the duration measurement:

In [13]:
from time import time

# [...]
for epoch in range(num_epochs):
    start = time()
    for example in train:
        image, label = example["image"], example["label"]
        loss_value = train_step(image, label)
        mean_loss.update_state(loss_value)

        if tf.equal(tf.math.mod(step, 10), 0):
            tf.print(
                step, " loss: ", mean_loss.result(), " acccuracy: ", accuracy.result()
            )
            mean_loss.reset_states()
            accuracy.reset_states()
    end = time()
    print("Time per epoch: ", end-start)
# remeaning code

590  loss:  0.16144824  acccuracy:  0.954861104
600  loss:  0.0597647242  acccuracy:  0.9875
610  loss:  0.0527747273  acccuracy:  0.984375
620  loss:  0.0505279601  acccuracy:  0.984375
630  loss:  0.101979196  acccuracy:  0.9625
Time per epoch:  5.251030921936035
640  loss:  0.0909952298  acccuracy:  0.969899654
650  loss:  0.0716021508  acccuracy:  0.984375
660  loss:  0.080451481  acccuracy:  0.96875
670  loss:  0.0813035071  acccuracy:  0.96875
680  loss:  0.0819518417  acccuracy:  0.965625
690  loss:  0.15046224  acccuracy:  0.934375
Time per epoch:  5.181036949157715
700  loss:  0.128995061  acccuracy:  0.946488321
710  loss:  0.180704758  acccuracy:  0.928125
720  loss:  0.0911093801  acccuracy:  0.95625
730  loss:  0.0698170885  acccuracy:  0.975
740  loss:  0.0626426861  acccuracy:  0.975
750  loss:  0.12250489  acccuracy:  0.95625
Time per epoch:  5.160876035690308
760  loss:  0.109201357  acccuracy:  0.956521749
770  loss:  0.29387337  acccuracy:  0.896875
780  loss:  0.116

On average, running the training loop on a Colab notebook (https://colab.research.google.com) equipped with an Nvidia k40 GPU, we obtain an execution speed as follows:

```
Time per epoch: 16.206
```

As shown in the next section, transfer learning using a pre-trained model as a feature extractor gives a considerable speed boost.

Sometimes, using a pre-trained model as a feature extractor only is not the best way to transfer knowledge from one domain to another, often because the domains are too different and the features learned are useless for solving the new task.

In these cases, it is possible—and recommended—to not have a fixed-feature extractor part but let the optimization algorithm change it, training the whole model end to end.

## Fine-tuning

Fine-tuning is a different approach to transfer learning. Both share the same goal of transferring the knowledge learned on a dataset on a specific task to a different dataset and a different task. Transfer learning, as shown in the previous section, reuses the pre-trained model without making any changes to its feature extraction part; in fact, it is considered a non-trainable part of the network.

Fine-tuning, instead, consists of fine-tuning the pre-trained network weights by continuing backpropagation.

## When to fine-tune

Fine-tuning a network requires having the correct hardware; backpropagating the gradients through a deeper network requires you to load more information in memory. Very deep networks have been trained from scratch in data centers with thousands of GPUs. Therefore, prepare to lower your batch size to as low as 1, depending on how much available memory you have.

Hardware requirements aside, there are other different points to keep in mind when thinking about fine-tuning:

* Dataset size: Fine-tuning a network means using a network with a lot of trainable parameters, and, as we know from the previous chapters, a network with a lot of parameters is prone to overfitting.
If the target dataset size is small, it is not a good idea to fine-tune the network. Using the network as a fixed-feature extractor will probably bring in better results.
* Dataset similarity: If the dataset size is large (where large means with a size comparable to the one the pre-trained model has been trained on) and it is similar to the original one, fine-tuning the model is probably a good idea. Slightly adjusting the network parameters will help the network to specialize in the extraction of features that are specific to this dataset, while correctly reusing the knowledge from the previous, similar dataset.
If the dataset size is large and it is very different from the original, fine-tuning the network could help. In fact, the initial solution of the optimization problem is likely to be close to a good minimum when starting with a pre-trained model, even if the dataset has different features to learn (this is because the lower layers of the CNN usually learn low-level features that are common to every classification task).

If the new dataset satisfies the similarity and size constraints, fine-tuning the model is a good idea. **One important parameter to look at closely is the learning rate.** When fine-tuning a pre-trained model, we suppose the model parameters are good (and they are since they are the parameters of the model that achieved state-of-the-art results on a difficult challenge), and, for this reason, a small learning rate is suggested.

Using a high learning rate would change the network parameters too much, and we don't want to change them in this way. Instead, using a small learning rate, we slightly adjust the parameters to make them adapt to the new dataset, without distorting them too much, thus reusing the knowledge without destroying it.

Of course, if the fine-tuning approach is chosen, the hardware requirements have to be kept in mind: lowering the batch size is perhaps the only way to fine-tune very deep models when using a standard GPU to do the work.

## TensorFlow Hub integration

Fine-tuning a model downloaded from TensorFlow Hub might sound difficult; we have to do the following:

1. Download the model parameters and graph
1. Restore the model parameters in the graph
1. Restore all the operations that are executed only during the training (activating dropout layers and enabling the moving mean and variance computed by the batch normalization layers)
1. Attach the new layers on top of the feature vector
1. Train the model end to end

In practice, the integration of TensorFlow Hub and Keras models is so tight that we can achieve all this by setting the trainable Boolean flag to True when importing the model using hub.KerasLayer:

In [30]:
hub.KerasLayer(
    "https://tfhub.dev/google/tf2-preview/inception_v3/feature_vector/2",
    output_shape=[2048],
    trainable=True) # <- That's all!

<tensorflow_hub.keras_layer.KerasLayer at 0x7f8b90916128>

## Train and evaluate

What happens if we build the same model as in the previous chapter, Chapter 5, Efficient Data Input Pipelines and Estimator API, and we train it on the tf_flower dataset, fine-tuning the weights?

The model is thus the one that follows; please note how the learning rate of the optimizer has been reduced from 1e-3 to 1e-5:

In [48]:
train

<MapDataset shapes: {image: (299, 299, 3), label: ()}, types: {image: tf.float32, label: tf.int64}>

In [0]:
model = tf.keras.Sequential(
    [
        hub.KerasLayer(
            "https://tfhub.dev/google/tf2-preview/inception_v3/feature_vector/2",
            output_shape=[2048],
            trainable=True, # <- enables fine tuning
        ),
        tf.keras.layers.Dense(512),
        tf.keras.layers.ReLU(),
        tf.keras.layers.Dense(num_classes), # linear
    ]
)

loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
step = tf.Variable(1, name="global_step", trainable=False)
optimizer = tf.optimizers.Adam(1e-5)

train_summary_writer = tf.summary.create_file_writer("./log/transfer/train")
validation_summary_writer = tf.summary.create_file_writer("./log/transfer/validation")

# Metrics
accuracy = tf.metrics.Accuracy()
mean_loss = tf.metrics.Mean(name="loss")

@tf.function
def train_step(inputs, labels):
    with tf.GradientTape() as tape:
        logits = model(inputs)
        loss_value = loss(labels, logits)

    gradients = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    step.assign_add(1)

    accuracy.update_state(labels, tf.argmax(logits, -1))
    return loss_value

# Configure the training set to use batches and prefetch
train = train.batch(32).prefetch(1)
validation = validation.batch(32).prefetch(1)
test = test.batch(32).prefetch(1)

num_epochs = 10
for epoch in range(num_epochs):

    for example in train:
        image, label = example["image"], example["label"]
        loss_value = train_step(image, label)
        mean_loss.update_state(loss_value)

        if tf.equal(tf.math.mod(step, 10), 0):
            tf.print(
                step, " loss: ", mean_loss.result(), " acccuracy: ", accuracy.result()
            )
            mean_loss.reset_states()
            accuracy.reset_states()

    # Epoch ended, measure performance on validation set
    tf.print("## VALIDATION - ", epoch)
    accuracy.reset_states()
    for example in validation:
        image, label = example["image"], example["label"]
        logits = model(image)
        accuracy.update_state(label, tf.argmax(logits, -1))
    tf.print("accuracy: ", accuracy.result())
    accuracy.reset_states()

10  loss:  1.46650445  acccuracy:  0.40625
20  loss:  1.12444711  acccuracy:  0.615625
30  loss:  0.757353902  acccuracy:  0.790625
40  loss:  0.498438358  acccuracy:  0.85625
50  loss:  0.43775019  acccuracy:  0.8375
## VALIDATION -  0
accuracy:  0.869138479
60  loss:  0.344315886  acccuracy:  0.96875
70  loss:  0.337916344  acccuracy:  0.890625
80  loss:  0.198741242  acccuracy:  0.93125
90  loss:  0.192094058  acccuracy:  0.925
100  loss:  0.11699377  acccuracy:  0.971875
110  loss:  0.115660056  acccuracy:  0.978125
## VALIDATION -  1
accuracy:  0.884405673
120  loss:  0.0963624641  acccuracy:  0.96875
130  loss:  0.0865608603  acccuracy:  0.98125
140  loss:  0.0634469837  acccuracy:  0.9875
150  loss:  0.0516169295  acccuracy:  0.990625
160  loss:  0.0408696234  acccuracy:  0.996875
170  loss:  0.0368052311  acccuracy:  0.99375
## VALIDATION -  2
accuracy:  0.897491813
180  loss:  0.0522813089  acccuracy:  0.99375
190  loss:  0.0380167142  acccuracy:  0.990625
200  loss:  0.027393

In the following box, the first and last training epochs' output is shown:

```
10 loss: 1.59038031 acccuracy: 0.288194448
20 loss: 1.25725865 acccuracy: 0.55625
30 loss: 0.932323813 acccuracy: 0.721875
40 loss: 0.63251847 acccuracy: 0.81875
50 loss: 0.498087496 acccuracy: 0.84375
## VALIDATION - 0
accuracy: 0.872410059

[...]

530 loss: 0.000400377758 acccuracy: 1
540 loss: 0.000466914673 acccuracy: 1
550 loss: 0.000909397728 acccuracy: 1
560 loss: 0.000376881275 acccuracy: 1
570 loss: 0.000533850689 acccuracy: 1
580 loss: 0.000438459858 acccuracy: 1
## VALIDATION - 9
accuracy: 0.925845146
```

As expected, the test accuracy reached the constant value of 1; hence we overfitted the training set. This was something expected since the tf_flower dataset is smaller and simpler than ImageNet. However, to see the overfitting problem clearly, we had to wait longer since having more parameters to train makes the whole learning process extremely slow, especially compared to the previous train when the pre-trained model was not trainable.


## Training speed

By adding the time measurements as we did in the previous section, it is possible to see how the fine-tuning process is extremely slow compared to transfer learning, using the model as a non-trainable feature extractor.

In fact, if, in the previous scenario, we reached an average training speed per epoch of about 16.2 seconds, now we have to wait, on average, 60.04 seconds, which is a 370% slowdown!

Moreover, it is interesting to see that at the end of the first epoch, we reached the same validation accuracy as was achieved in the previous training and that, despite overfitting the training data, the validation accuracy obtained at the end of the tenth epoch is greater than the previous one.

This simple experiment showed how using a pre-trained model as a feature extractor could lead to worse performance than fine-tuning it. This means that the features the network learned to extract on the ImageNet dataset are too different from the features that would be needed to classify the flowers, dataset correctly.

Choosing whether to use a pre-trained model as a fixed-feature extractor or to fine-tune it is a tough decision, involving a lot of trade-offs. Understanding whether the pre-trained model extracts features that are correct for the new task is complicated; merely looking at dataset size and similarity is a guideline, but in practice, this decision requires several tests.

Of course, it is better to use the pre-trained model as a feature extractor first, and, if the new model's performance is already satisfactory, there is no need to waste time trying to fine-tune it. If the results are not satisfying, it is worth trying a different pre-trained model and, as a last resort, trying the fine-tuning approach (because this requires more computational power, and it is expansive).