---------------------------------------
# Introduction

**Neural Style Transfer**

> Neural Style Transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image.
>
> ![](https://upload.wikimedia.org/wikipedia/commons/a/a2/Mona_lisa_the_starry_night_o_lbfgs_i_content_h_720_m_vgg19_cw_100000.0_sw_30000.0_tv_1.0.jpg)
>
> Mona Lisa in the style of "The Starry Night" using neural style transfer.
>
> **NST**
>
> NST was first published in the paper "A Neural Algorithm of Artistic Style" by Leon Gatys et al.
>
> NST is based on histogram-based texture synthesis algorithms, notably the method of Portilla and Simoncelli. NST can be summarized as histogram-based texture synthesis with convolutional neural network (CNN) features for the image analogies problem. The original paper used a VGG-19 architecture that has been pre-trained to perform object recognition using the ImageNet dataset.

Ref: https://en.wikipedia.org/wiki/Neural_Style_Transfer

## A brief overview of the approaches and resources

There are two main approaches to NST:
1. The first one is based on the original [paper by Gatys et al.](https://arxiv.org/abs/1508.06576) and its subsequent improvements and developments. The main idea here is to take the content image and change it in a way to make its low-level features similar to the style image.

    Roughly, the development of these methods can be summarized like this:
    * 2015 original Gatys' approach optimizing an image directly ([paper](https://arxiv.org/abs/1508.06576), [keras tutorial](https://keras.io/examples/generative/neural_style_transfer/), [pytorch notebook](https://www.kaggle.com/code/ohseokkim/transfering-style)) ->
    * 2016 Johnson et al. ([paper](https://link.springer.com/chapter/10.1007/978-3-319-46475-6_43)); Ulyanov et al. ([paper](https://arxiv.org/abs/1603.03417)) started training a network that transforms initial image into new one (added speed but wasn't flexible). It could be a good fit for this particular competition, since we are asked to transfer only one style. ->
    * 2017 ConditionalIN ([paper](https://arxiv.org/abs/1610.07629)) allowed transferring different styles with one network by learning $\mu$ and $\sigma$ ->
    * 2017 AdaIN ([paper](https://arxiv.org/pdf/1703.06868.pdf), [tutorial](https://keras.io/examples/generative/adain/)) allowed transferring any style by simply useing $\mu$ and $\sigma$ of the style image. Howeer, it still needs a decoder trained using style images, so the performance on unseen styles could be worse ->
    * 2017 Universal Style Transfer via Feature Transforms ([blog](https://towardsdatascience.com/universal-style-transfer-b26ba6760040), [paper](https://arxiv.org/abs/1705.08086)) allows transferring any style while the whole process of training does not use any style images.
    
    [A nice video describing this.](https://www.youtube.com/watch?v=8pp0Oa3t52s&t=536s&ab_channel=TheAIEpiphany)

2. The second approach uses Generative Adversarial Networks (GANs) to create an image that is similar to the images of chosen style but preserves the content of the original image. Probably, the most influential work here is CycleGANs  and its developments ([paper](https://arxiv.org/abs/1703.10593) and authors' awesome [project page](https://junyanz.github.io/CycleGAN/) with lots of resources, [keras](https://keras.io/examples/generative/cyclegan/) and [tf](https://www.tensorflow.org/tutorials/generative/cyclegan) tutorials).

[Collection of Keras tutorials, some of which are devoted to NST.](https://keras.io/examples/generative/)

In this work, we will explore original **Gatys'** approach, **AdaIN** method, and **CycleGANs**.


# 0. Setup and some image exploration

This part is heavily copy-pasted from [this notedook](https://www.kaggle.com/code/ihelon/monet-visualization-and-augmentation), all credits should go to the author.

In [None]:
import os
import math
import random

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cv2

In [None]:
def set_seed(seed):
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)

SEED = 42
set_seed(SEED)

In [None]:
BASE_PATH = '/kaggle/input'
#BASE_PATH = '.'
for dirname, _, filenames in os.walk(BASE_PATH):
    print(dirname)

In [None]:
MONET_PATH_JPG = os.path.join(BASE_PATH, "gan-getting-started", "monet_jpg")
PHOTO_PATH_JPG = os.path.join(BASE_PATH, "gan-getting-started","photo_jpg")
STYLES_FOR_NST = os.path.join(BASE_PATH, "neural-style-transfer", "Style Images")
ARTS_FOR_NST = os.path.join(BASE_PATH, "images-for-style-transfer", "Data", "Artworks")
BEST_ARTS_FOR_NST = os.path.join(BASE_PATH, "best-artworks-of-all-time", "resized", "resized")

In [None]:
def print_folder_statistics(path):
    d_image_sizes = {}
    max_to_print = 20
    try:
        for image_name in os.listdir(path):
            image = cv2.imread(os.path.join(path, image_name))
            d_image_sizes[image.shape] = d_image_sizes.get(image.shape, 0) + 1
            if len(d_image_sizes) >= max_to_print:
                break
        for i, (size, count) in enumerate(d_image_sizes.items()):
            print(f"shape: {size}\tcount: {count}")
            if i >= max_to_print:
                break
    except:
        pass

print(f"Monet images:")
print_folder_statistics(MONET_PATH_JPG)
print("-" * 10)
print(f"Photo images:")
print_folder_statistics(PHOTO_PATH_JPG)
print("-" * 10)
print(f"NST style images:")
print_folder_statistics(STYLES_FOR_NST)
print("-" * 10)
print(f"Artwork images:")
print_folder_statistics(ARTS_FOR_NST)
print("-" * 10)
print(f"Best Artworks:")
print_folder_statistics(BEST_ARTS_FOR_NST)
print("-" * 10)

In [None]:
# let's save required size
IMAGE_SIZE = 256

In [None]:
def batch_visualization(path, n_images, is_random=True, figsize=(16, 16)):
    plt.figure(figsize=figsize)    
    w = int(n_images ** .5)
    h = math.ceil(n_images / w)
    all_names = os.listdir(path)
    image_names = all_names[:n_images]
    if is_random:
        image_names = random.sample(all_names, n_images)
    for ind, image_name in enumerate(image_names):
        img = cv2.imread(os.path.join(path, image_name))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) 
        plt.subplot(h, w, ind + 1)
        plt.imshow(img)
        plt.axis("off")
    plt.show()

Some Monet images:

In [None]:
batch_visualization(MONET_PATH_JPG, 9, is_random=True)

In [None]:
batch_visualization(PHOTO_PATH_JPG, 16, is_random=True)

In [None]:
batch_visualization(STYLES_FOR_NST, 2, is_random=True)

In [None]:
batch_visualization(ARTS_FOR_NST, 16, is_random=True)

In [None]:
batch_visualization(BEST_ARTS_FOR_NST, 16, is_random=True)

------------------------------------------------
# 1. Original Gatys' approach
Using Keras tutorial: [Neural style transfer](https://keras.io/examples/generative/neural_style_transfer/)

Style transfer consists in generating an image ($NST$)
with the same "content" ($C$) as a base image, but with the
"style" ($S$) of a different picture (typically artistic).
This is achieved through the optimization of a loss function ($L_t$)
that has 3 components: "style loss" ($L_s$), "content loss" ($L_c$),
and "total variation loss" ($L_v$):

$L_t = w_s \cdot L_s + w_c \cdot L_c + w_v \cdot L_v$

- The style loss is where the deep learning keeps in - that one is defined
using a deep convolutional neural network. Precisely, it consists in a sum of
$L_2$ distances between the [Gram matrices](https://en.wikipedia.org/wiki/Gram_matrix) of the representations of
the base image and the style reference image, extracted from
different layers of a convnet (trained on `ImageNet`). The general idea
is to capture color/texture information at different spatial
scales:

    $L_s = \sum_{i=1}^{l}{||GramMatrix(\phi_i(NST)) - GramMatrix(\phi_i(S))||_2}$

    where $\phi_i$ denotes the layers of convnet used to compute the loss (VGG-19 in this case).

    > If the vectors are centered random variables, the Gram matrix is approximately proportional to the covariance matrix, with the scaling determined by the number of elements in the vector.

    https://en.wikipedia.org/wiki/Gram_matrix

- The content loss is a L2 distance between the features of the base
image (extracted from a deep layer $d$) and the features of the combination image,
keeping the generated image close enough to the original one:

    $L_c = ||\phi_d(NST) - \phi_d(C)||_2$

- The total variation loss imposes local spatial continuity between
the pixels of the combination image, giving it visual coherence.

![](https://miro.medium.com/max/1400/1*VAQs1KSfbysnloPah_fHGQ.gif)

Picture Credit: https://miro.medium.com

**Reference:** [A Neural Algorithm of Artistic Style](
  http://arxiv.org/abs/1508.06576)

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.applications import vgg19
!mkdir ../gatys_generated

## Image preprocessing / deprocessing utilities

In [None]:
def preprocess_image(image_path):
    # Util function to open, resize and format pictures into appropriate tensors
    img = keras.preprocessing.image.load_img(
        image_path, target_size=(IMAGE_SIZE, IMAGE_SIZE)
    )
    img = keras.preprocessing.image.img_to_array(img)
    img = np.expand_dims(img, axis=0)
    # vgg16.preprocess_input will convert the input images from RGB to BGR, then will zero-center each color channel with respect to the ImageNet dataset, without scaling
    img = vgg19.preprocess_input(img)
    return tf.convert_to_tensor(img)


def deprocess_image(x):
    x = x.numpy()
    # Util function to convert a tensor into a valid image
    x = x.reshape((IMAGE_SIZE, IMAGE_SIZE, 3))
    # Remove zero-center by mean pixel
    x[:, :, 0] += 103.939
    x[:, :, 1] += 116.779
    x[:, :, 2] += 123.68
    # 'BGR'->'RGB'
    x = x[:, :, ::-1]
    x = np.clip(x, 0, 255).astype("uint8")
    return x

## Compute the loss functions

The authors propose to use a pretrained VGG-19 to compute the loss
function of the network.
The total loss is a weighted combination of:

- The `style_loss` function, which keeps the generated image close to the local textures
of the style reference image
- The `content_loss` function, which keeps the high-level representation of the
generated image close to that of the base image
- The `total_variation_loss` function, a regularization loss which keeps the generated image locally-coherent

So, we define 4 utility functions: `gram_matrix`, `style_loss`, `content_loss`, and `total_variation_loss`.

In [None]:
# The gram matrix of an image tensor (feature-wise outer product)
def gram_matrix(x):
    # x.shape = (256, 256, n_chanels)
    x = tf.transpose(x, (2, 0, 1))
    # x.shape = (n_chanels, 256, 256)
    features = tf.reshape(x, (tf.shape(x)[0], -1))
    # features.shape = (n_chanels, 256*256)
    gram = tf.matmul(features, tf.transpose(features))
    # gram.shape = (n_chanels, n_chanels)
    return gram


# The "style loss" is designed to maintain
# the style of the reference image in the generated image.
# It is based on the gram matrices (which capture style) of
# feature maps from the style reference image
# and from the generated image
def style_loss(style, generated):
    S = gram_matrix(style)
    C = gram_matrix(generated)
    channels = 3
    size = IMAGE_SIZE * IMAGE_SIZE * channels
    return tf.reduce_mean(tf.square(S - C)) /  (size ** 2)


# An auxiliary loss function
# designed to maintain the "content" of the
# base image in the generated image
def content_loss(base, generated):
    return tf.reduce_mean(tf.square(generated - base))


# The 3rd loss function, total variation loss,
# designed to keep the generated image locally coherent
def total_variation_loss(generated):
    img_nrows = generated.shape[1]
    img_ncols = generated.shape[2]
    horizontal_shift_diff = tf.square(
        generated[:, : img_nrows - 1, : img_ncols - 1, :] - generated[:, 1:, : img_ncols - 1, :]
    )
    vertical_shift_diff = tf.square(
        generated[:, : img_nrows - 1, : img_ncols - 1, :] - generated[:, : img_nrows - 1, 1:, :]
    )
    return tf.reduce_mean(tf.pow(horizontal_shift_diff + vertical_shift_diff, 1.25))

Next, we create a feature extraction model that retrieves the intermediate activations
of VGG19 (as a dict, by name).

In [None]:
# Build a VGG19 model loaded with pre-trained ImageNet weights
vgg19_model = vgg19.VGG19(weights="imagenet", include_top=False)

# Get the symbolic outputs of each "key" layer (we gave them unique names).
outputs_dict = dict([(layer.name, layer.output) for layer in vgg19_model.layers])

# Set up a model that returns the activation values for every layer in
# VGG19 (as a dict).
feature_extractor = keras.Model(inputs=vgg19_model.inputs, outputs=outputs_dict)


Finally, the code that computes the style transfer loss:

In [None]:
def compute_loss(combination_image, base_image, style_reference_image):
    # List of layers to use for the style loss.
    style_layer_names = {
        "block1_conv1": 1.,
        "block2_conv1": 0.75,
        "block3_conv1": 0.3,
        "block4_conv1": 0.2,
        "block5_conv1": 0.2,
    }
    # The layer to use for the content loss.
    content_layer_name = "block5_conv2"
    # Weights of the different loss components
    total_variation_weight = 1e-1
    style_weight = 1e-2
    content_weight = 2.5e-2
    input_tensor = tf.concat(
        [base_image, style_reference_image, combination_image], axis=0
    )
    features = feature_extractor(input_tensor)
    # Initialize the loss
    loss = tf.zeros(shape=())
    # Add content loss
    layer_features = features[content_layer_name]
    base_image_features = layer_features[0, :, :, :]
    combination_features = layer_features[2, :, :, :]
    loss = loss + content_weight * content_loss(
        base_image_features, combination_features
    )
    # Add style loss
    for layer_name, weight in style_layer_names.items():
        layer_features = features[layer_name]
        style_reference_features = layer_features[1, :, :, :]
        combination_features = layer_features[2, :, :, :]
        sl = weight * style_loss(style_reference_features, combination_features)
        loss += (style_weight / len(style_layer_names)) * sl
    # Add total variation loss
    loss += total_variation_weight * total_variation_loss(combination_image)
    return loss


# Add a tf.function decorator to loss & gradient computation
# to compile it, and thus make it fast.
@tf.function
def compute_loss_and_grads(generated_image, base_image, style_reference_image):
    with tf.GradientTape() as tape:
        loss = compute_loss(generated_image, base_image, style_reference_image)
    grads = tape.gradient(loss, generated_image)
    return loss, grads

## The training loop

Repeatedly run vanilla gradient descent steps to minimize the loss, and save the
resulting image every 100 iterations.

We decay the learning rate by 0.96 every 100 steps.


In [None]:
def generate_image(base_image_path, style_reference_image_path, result_prefix, iterations):
    base_image = preprocess_image(base_image_path)
    style_reference_image = preprocess_image(style_reference_image_path)
    generated_image = tf.Variable(preprocess_image(base_image_path))
    optimizer = keras.optimizers.SGD(
        keras.optimizers.schedules.ExponentialDecay(
            initial_learning_rate=100.0, decay_steps=100, decay_rate=0.96
        )
    )
    for i in range(1, iterations + 1):
        loss, grads = compute_loss_and_grads(
            generated_image, base_image, style_reference_image
        )
        optimizer.apply_gradients([(grads, generated_image)])
        if i % 50 == 0:
            print("Iteration %d: loss=%.2f" % (i, loss))
            img = deprocess_image(generated_image)
            fname = result_prefix + "_at_iteration_%d.jpg" % i
            keras.preprocessing.image.save_img(fname, img)
    return deprocess_image(generated_image)

In [None]:
def batch_generation(path_content, path_style, n_images, iterations, is_random_base=True, is_random_style=True, figsize=(16, 16)):
    plt.figure(figsize=figsize)    
    w = 3 * int(n_images ** .5)
    h = math.ceil(3 * n_images / w)
    all_content_names = os.listdir(path_content)
    all_style_names = os.listdir(path_style)
    if is_random_style:
        style_image_names = random.choices(all_style_names, k=n_images)
    else:
        style_image_names = all_style_names[:n_images]
    if is_random_base:
        content_image_names = random.choices(all_content_names, k=n_images)
    else:
        content_image_names = all_content_names[:n_images]
    for ind, (content_image_name, style_image_name) in enumerate(zip(content_image_names, style_image_names)):
        base_image_path = os.path.join(path_content, content_image_name)
        style_reference_image_path = os.path.join(path_style, style_image_name)
        result_prefix = os.path.join("../gatys_generated",content_image_name)
        img = generate_image(base_image_path, style_reference_image_path, result_prefix, iterations)
        plt.subplot(h, w, 3*ind + 1)
        base_img = keras.preprocessing.image.load_img(
            base_image_path, target_size=(IMAGE_SIZE, IMAGE_SIZE)
        )
        plt.imshow(base_img)
        plt.axis("off")
        plt.subplot(h, w, 3*ind + 2)
        style_img = keras.preprocessing.image.load_img(
            style_reference_image_path, target_size=(IMAGE_SIZE, IMAGE_SIZE)
        )
        plt.imshow(style_img)
        plt.axis("off")
        plt.subplot(h, w, 3*ind + 3)
        plt.imshow(img)
        plt.axis("off")
    plt.show()

In [None]:
%%time
random.seed(11)
batch_generation(PHOTO_PATH_JPG, STYLES_FOR_NST, 3, 300, is_random_base=True, is_random_style=True)

In [None]:
%%time
random.seed(11)
batch_generation(PHOTO_PATH_JPG, MONET_PATH_JPG, 3, 300, is_random_base=True, is_random_style=False)

## Conclusion

This is quite powerful approach that can provide nice looking results, but it requires some fine-tuning of the weights.

The main disadvantage for this competition however is that it is very slow (~10 sec. per image only for 300 iterations). It would require hours to create 7k images for submission. So, we shall explore other methods.

------------------------------------------------
# 2. Adaptive Instance Normalization
Adopted from Keras tutorial: [Neural Style Transfer with AdaIN](https://keras.io/examples/generative/adain/)

Follow-up papers that introduced
[Batch Normalization](https://arxiv.org/abs/1502.03167),
[Instance Normalization](https://arxiv.org/abs/1701.02096) and
[Conditional Instance Normalization](https://arxiv.org/abs/1610.07629)
allowed Style Transfer to be performed in new ways, no longer
requiring a slow iterative process.
Following these papers, the authors Xun Huang and Serge Belongie proposed
[Adaptive Instance Normalization](https://arxiv.org/abs/1703.06868) (AdaIN),
which allows arbitrary style transfer in real time.

You can also try out AdaIN model with your own images with this
[Hugging Face demo](https://huggingface.co/spaces/keras-io/AdaIN).

## Architecture

The style transfer network takes a content image and a style image as
inputs and outputs the style transferred image. The authors of AdaIN
propose a simple encoder-decoder structure for achieving this.

![AdaIN architecture](https://i.imgur.com/JbIfoyE.png)

The content image ($C$) and the style image ($S$) are both fed to the
encoder networks. The output from these encoder networks (feature maps)
are then fed to the $AdaIN$ layer. The $AdaIN$ layer computes a combined
feature map. This feature map is then fed into a randomly initialized
decoder network that serves as the generator for the neural style
transferred image.

$E_s = f(S), E_c = f(C)$

$t = AdaIN(E_c, E_s)$

$NST = Generator(t)$

The style feature map ($E_s$) and the content feature map ($E_c$) are
fed to the $AdaIN$ layer. This layer produced the combined feature map ($t$).

### Encoder

The encoder is a part of the pretrained (on [imagenet](https://www.image-net.org/)) VGG19 model.
We slice the model from the `block4-conv1` layer. The output layer is as suggested
by the authors in their paper.

In [None]:
def get_encoder():
    vgg19 = keras.applications.VGG19(
        include_top=False,
        weights="imagenet",
        input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3),
    )
    vgg19.trainable = False
    mini_vgg19 = keras.Model(vgg19.input, vgg19.get_layer("block4_conv1").output)
    inputs = keras.layers.Input([IMAGE_SIZE, IMAGE_SIZE, 3])
    mini_vgg19_out = mini_vgg19(inputs)
    return keras.Model(inputs, mini_vgg19_out, name="mini_vgg19")

### Adaptive Instance Normalization

The $AdaIN$ layer takes in the features of the content and style image.
The layer can be defined via the following equation:

$AdaIN(x, y) = \sigma (y) \left( \frac{x - \mu (x)}{\sigma (x)} \right) + \mu (y)$

where $\sigma$ is the standard deviation and $\mu$ is the mean for the
concerned variable. In the above equation the mean and variance of the
content feature map $E_c$ is aligned with the mean and variance of the
style feature maps $E_s$.

It is important to note that the $AdaIN$ layer proposed by the authors
uses no other parameters apart from mean and variance. The layer also
does not have any trainable parameters. This is why we use a
*Python function* instead of using a *Keras layer*. The function takes
style and content feature maps, computes the mean and standard deviation
of the images and returns the adaptive instance normalized feature map.

In [None]:
def get_mean_std(x, epsilon=1e-5):
    axes = [1, 2]
    # Compute the mean and standard deviation of a tensor.
    mean, variance = tf.nn.moments(x, axes=axes, keepdims=True)
    standard_deviation = tf.sqrt(variance + epsilon)
    return mean, standard_deviation


def ada_in(style, content):
    content_mean, content_std = get_mean_std(content)
    style_mean, style_std = get_mean_std(style)
    t = style_std * (content - content_mean) / content_std + style_mean
    return t

### Decoder

The authors specify that the decoder network must mirror the encoder
network.  We have symmetrically inverted the encoder to build our
decoder. We have used `UpSampling2D` layers to increase the spatial
resolution of the feature maps.

Note that the authors warn against using any normalization layer
in the decoder network, and do indeed go on to show that including
batch normalization or instance normalization hurts the performance
of the overall network.

This is the only portion of the entire architecture that is trainable.

In [None]:
def get_decoder():
    config = {"kernel_size": 3, "strides": 1, "padding": "same", "activation": "relu"}
    decoder = keras.Sequential(
        [
            keras.layers.InputLayer((None, None, 512)),
            keras.layers.Conv2D(filters=512, **config),
            # Default: size=(2, 2), data_format="channels_last", interpolation="nearest"
            keras.layers.UpSampling2D(),
            keras.layers.Conv2D(filters=256, **config),
            keras.layers.Conv2D(filters=256, **config),
            keras.layers.Conv2D(filters=256, **config),
            keras.layers.Conv2D(filters=256, **config),
            keras.layers.UpSampling2D(),
            keras.layers.Conv2D(filters=128, **config),
            keras.layers.Conv2D(filters=128, **config),
            keras.layers.UpSampling2D(),
            keras.layers.Conv2D(filters=64, **config),
            keras.layers.Conv2D(
                filters=3,
                kernel_size=3,
                strides=1,
                padding="same",
                activation="linear"#"sigmoid",
            ),
        ]
    )
    return decoder

### Loss functions

Here we build the loss functions for the neural style transfer model.
The authors propose to use a pretrained VGG-19 to compute the loss
function of the network. It is important to keep in mind that this
will be used for training only the decoder network. The total
loss ($L_t$) is a weighted combination of content loss ($L_c$) and style
loss ($L_s$). The $w_s$ term is used to vary the amount of style
transferred.

$L_t = L_c + w_s L_s$

#### Content Loss

This is the Euclidean distance between the content image features
and the features of the neural style transferred image.

$L_c = ||f(NST) - t||_2$

Here the authors propose to use the output from the $AdaIn$ layer $t$ as
the content target rather than using features of the original image as
target. This is done to speed up convergence.

#### Style Loss

Rather than using the more commonly used `Gram Matrix`,
the authors propose to compute the difference between the statistical features
(mean and variance) which makes it conceptually cleaner. This can be
expressed via the following equation:

$L_s = \sum_{i=1}^{l}{\left(||\mu(\phi_i(NST)) - \mu(\phi_i(S))||_2 + ||\sigma(\phi_i(NST)) - \sigma(\phi_i(S))||_2 \right)}$

where $\phi_i$ denotes the layers in VGG-19 used to compute the loss.
In this case this corresponds to:

- `block1_conv1`
- `block1_conv2`
- `block1_conv3`
- `block1_conv4`

In [None]:
def get_loss_net():
    vgg19 = keras.applications.VGG19(
        include_top=False, weights="imagenet", input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3)
    )
    vgg19.trainable = False
    layer_names = ["block1_conv1", "block2_conv1", "block3_conv1", "block4_conv1"]
    outputs = [vgg19.get_layer(name).output for name in layer_names]
    mini_vgg19 = keras.Model(vgg19.input, outputs)
    inputs = keras.layers.Input([IMAGE_SIZE, IMAGE_SIZE, 3])
    mini_vgg19_out = mini_vgg19(inputs)
    return keras.Model(inputs, mini_vgg19_out, name="loss_net")

## NST Model

We wrap the encoder and decoder inside of a `tf.keras.Model` subclass.
This allows us to customize what happens in the `model.fit()` loop.

In [None]:
class AdaInNSTModel(tf.keras.Model):
    def __init__(self, encoder, decoder, loss_net, style_weight, **kwargs):
        super().__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder
        self.loss_net = loss_net
        self.style_weight = style_weight

    def compile(self, optimizer, loss_fn):
        super().compile()
        self.optimizer = optimizer
        self.loss_fn = loss_fn
        self.style_loss_tracker = keras.metrics.Mean(name="style_loss")
        self.content_loss_tracker = keras.metrics.Mean(name="content_loss")
        self.total_loss_tracker = keras.metrics.Mean(name="total_loss")

    def train_step(self, inputs):
        style, content = inputs
        # Initialize the content and style loss.
        loss_content = 0.0
        loss_style = 0.0
        with tf.GradientTape() as tape:
            # Encode the style and content image.
            style_encoded = self.encoder(style)
            content_encoded = self.encoder(content)
            # Compute the AdaIN target feature maps.
            t = ada_in(style=style_encoded, content=content_encoded)
            # Generate the neural style transferred image.
            reconstructed_image = self.decoder(t)
            # Compute the losses.
            reconstructed_vgg_features = self.loss_net(reconstructed_image)
            style_vgg_features = self.loss_net(style)
            loss_content = self.loss_fn(t, reconstructed_vgg_features[-1])
            for inp, out in zip(style_vgg_features, reconstructed_vgg_features):
                mean_inp, std_inp = get_mean_std(inp)
                mean_out, std_out = get_mean_std(out)
                loss_style += self.loss_fn(mean_inp, mean_out) + self.loss_fn(std_inp, std_out)
            loss_style = self.style_weight * loss_style
            total_loss = loss_content + loss_style
        # Compute gradients and optimize the decoder.
        trainable_vars = self.decoder.trainable_variables
        gradients = tape.gradient(total_loss, trainable_vars)
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))
        # Update the trackers.
        self.style_loss_tracker.update_state(loss_style)
        self.content_loss_tracker.update_state(loss_content)
        self.total_loss_tracker.update_state(total_loss)
        return {
            "style_loss": self.style_loss_tracker.result(),
            "content_loss": self.content_loss_tracker.result(),
            "total_loss": self.total_loss_tracker.result(),
        }

    def test_step(self, inputs):
        style, content = inputs
        # Initialize the content and style loss.
        loss_content = 0.0
        loss_style = 0.0
        # Encode the style and content image.
        style_encoded = self.encoder(style)
        content_encoded = self.encoder(content)
        # Compute the AdaIN target feature maps.
        t = ada_in(style=style_encoded, content=content_encoded)
        # Generate the neural style transferred image.
        reconstructed_image = self.decoder(t)
        # Compute the losses.
        recons_vgg_features = self.loss_net(reconstructed_image)
        style_vgg_features = self.loss_net(style)
        loss_content = self.loss_fn(t, recons_vgg_features[-1])
        for inp, out in zip(style_vgg_features, recons_vgg_features):
            mean_inp, std_inp = get_mean_std(inp)
            mean_out, std_out = get_mean_std(out)
            loss_style += self.loss_fn(mean_inp, mean_out) + self.loss_fn(std_inp, std_out)
        loss_style = self.style_weight * loss_style
        total_loss = loss_content + loss_style
        # Update the trackers.
        self.style_loss_tracker.update_state(loss_style)
        self.content_loss_tracker.update_state(loss_content)
        self.total_loss_tracker.update_state(total_loss)
        return {
            "style_loss": self.style_loss_tracker.result(),
            "content_loss": self.content_loss_tracker.result(),
            "total_loss": self.total_loss_tracker.result(),
        }

    @property
    def metrics(self):
        return [
            self.style_loss_tracker,
            self.content_loss_tracker,
            self.total_loss_tracker,
        ]

## Data and tf.data pipeline
To show the full power of the approach, we train the model not only on the Monet paintings, but on different examples of the artwork. To do this, we use a great kaggle dataset [best-artworks-of-all-time](https://www.kaggle.com/datasets/ikarus777/best-artworks-of-all-time).
In this section, we decode, convert and resize the images from the folder.
After we have our style and content data pipeline ready, we zip the two together to obtain the data pipeline that our model will consume.

In [None]:
def decode_and_resize(image_path):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image, channels=3)
    #image = tf.image.convert_image_dtype(image, dtype="float32")
    image = tf.image.resize(image, (IMAGE_SIZE, IMAGE_SIZE))
    # vgg16.preprocess_input will convert the input images from RGB to BGR,
    # then will zero-center each color channel with respect to the ImageNet dataset, without scaling
    image = vgg19.preprocess_input(image)
    return image


# Get the image file paths
monet_style_images = [os.path.join(MONET_PATH_JPG, file_name) for file_name in os.listdir(MONET_PATH_JPG)]
style_images = monet_style_images + [
    os.path.join(BEST_ARTS_FOR_NST, file_name) for file_name in os.listdir(BEST_ARTS_FOR_NST)]
# Build the style and content tf.data datasets.
style_ds = (tf.data.Dataset.from_tensor_slices(style_images)
            .map(decode_and_resize, num_parallel_calls=tf.data.AUTOTUNE).repeat())
content_ds = (keras.utils.image_dataset_from_directory(PHOTO_PATH_JPG, labels=None).unbatch()
              .map(vgg19.preprocess_input, num_parallel_calls=tf.data.AUTOTUNE).repeat())
# Zipping the style and content datasets.
BATCH_SIZE = 32
train_ds = (
    tf.data.Dataset.zip((style_ds, content_ds))
    .shuffle(BATCH_SIZE * 2)
    .batch(BATCH_SIZE)
    .prefetch(tf.data.AUTOTUNE)
)

#### Visualizing the data
It is always better to visualize the data before training. To ensure the correctness of our preprocessing pipeline, we visualize 5 samples from our dataset.

In [None]:
def visualize_ds(ds, deprocess=True):
    style, content = next(iter(ds))
    num_to_vis = 5
    fig, axes = plt.subplots(nrows=2, ncols=num_to_vis, figsize=(15, 5))
    [ax.axis("off") for ax in np.ravel(axes)]
    for (axis, style_image, content_image) in zip(axes.T, style[0:num_to_vis], content[0:num_to_vis]):
        (ax_style, ax_content) = axis
        if deprocess:
            ax_style.imshow(deprocess_image(style_image))
        else:
            ax_style.imshow(style_image.numpy().reshape((IMAGE_SIZE, IMAGE_SIZE, 3)) / 2. + 0.5)
        ax_style.set_title("Style Image")
        if deprocess:
            ax_content.imshow(deprocess_image(content_image))
        else:
            ax_content.imshow(content_image.numpy().reshape((IMAGE_SIZE, IMAGE_SIZE, 3)) / 2. + 0.5)
        ax_content.set_title("Content Image")
        
visualize_ds(train_ds)

## Training Monitor callback

This callback is used to visualize the style transfer output of
the model at the end of each epoch. The objective of style transfer cannot be
quantified properly, and is to be subjectively evaluated by an audience.
For this reason, visualization is a key aspect of evaluating the model.

In [None]:
class TrainMonitor(keras.callbacks.Callback):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.test_style, self.test_content = next(iter(train_ds))

    def on_epoch_end(self, epoch, logs=None):
        epoch += 1
        if epoch == 1 or epoch % 10 == 0:
            # Encode the style and content image.
            test_style_encoded = self.model.encoder(self.test_style)
            test_content_encoded = self.model.encoder(self.test_content)
            # Compute the AdaIN features.
            test_t = ada_in(style=test_style_encoded, content=test_content_encoded)
            test_reconstructed_image = self.model.decoder(test_t)
            # Plot the Style, Content and the NST image.
            fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(20, 5))
            [ax.axis("off") for ax in np.ravel(axs)]
            axs[0].imshow(deprocess_image(self.test_style[0]))
            axs[0].set_title(f"Style: {epoch:03d}")
            axs[1].imshow(deprocess_image(self.test_content[0]))
            axs[1].set_title(f"Content: {epoch:03d}")
            axs[2].imshow(deprocess_image(test_reconstructed_image[0]))
            axs[2].set_title(f"NST: {epoch:03d}")
            plt.show()
            plt.close()
            keras.models.save_model(self.model.decoder, "adain_decoder_checkpoint.h5")


## Training the model

In this section, we define the optimizer, the loss funtion, and the
trainer module. We compile the trainer module with the optimizer and
the loss function and then train it.

In [None]:
optimizer = keras.optimizers.Adam(learning_rate=1e-5)
loss_fn = keras.losses.MeanSquaredError()
encoder = get_encoder()
loss_net = get_loss_net()
decoder = get_decoder()
adain_model = AdaInNSTModel(encoder=encoder, decoder=decoder, loss_net=loss_net, style_weight=4.0)
adain_model.compile(optimizer=optimizer, loss_fn=loss_fn)
train_monitor_cb = TrainMonitor()
EPOCHS = 1
history = adain_model.fit(train_ds, epochs=EPOCHS, steps_per_epoch=50, callbacks=[train_monitor_cb])

In [None]:
keras.models.save_model(decoder, "adain_decoder_final.h5")

In [None]:
decoder = keras.models.load_model("/kaggle/input/different-approaches-to-style-transfer-v0/adain_decoder_final.h5")
adain_model = AdaInNSTModel(encoder=encoder, decoder=decoder, loss_net=loss_net, style_weight=4.0)
adain_model.compile(optimizer=optimizer, loss_fn=loss_fn)

## Inference

Now we can run inference with the trained model.
We will pass arbitrary content and style images from the dataset and take a look at the output images.

In [None]:
for style, content in train_ds.take(1):
    style_encoded = adain_model.encoder(style)
    content_encoded = adain_model.encoder(content)
    t = ada_in(style=style_encoded, content=content_encoded)
    reconstructed_image = adain_model.decoder(t)
    fig, axes = plt.subplots(nrows=10, ncols=3, figsize=(10, 30))
    [ax.axis("off") for ax in np.ravel(axes)]
    for axis, style_image, content_image, reconstructed_image in zip(
        axes, style[0:10], content[0:10], reconstructed_image[0:10]
    ):
        (ax_style, ax_content, ax_reconstructed) = axis
        ax_style.imshow(deprocess_image(style_image))
        ax_style.set_title("Style Image")
        ax_content.imshow(deprocess_image(content_image))
        ax_content.set_title("Content Image")
        ax_reconstructed.imshow(deprocess_image(reconstructed_image))
        ax_reconstructed.set_title("NST Image")

## Create submission file

In [None]:
import shutil
! mkdir ../adain_images

In [None]:
def save_adain_submission():
    monet_ds = (tf.data.Dataset.from_tensor_slices(monet_style_images)
                .map(decode_and_resize, num_parallel_calls=tf.data.AUTOTUNE).repeat())
    photo_ds = (keras.utils.image_dataset_from_directory(PHOTO_PATH_JPG, labels=None).unbatch()
                  .map(vgg19.preprocess_input, num_parallel_calls=tf.data.AUTOTUNE))
    test_ds = (
        tf.data.Dataset.zip((monet_ds, photo_ds))
        .shuffle(BATCH_SIZE * 2)
        .batch(BATCH_SIZE)
        .prefetch(tf.data.AUTOTUNE)
    )
    i = 1
    for style, content in test_ds:
        style_encoded = adain_model.encoder(style)
        content_encoded = adain_model.encoder(content)
        t = ada_in(style=style_encoded, content=content_encoded)
        reconstructed_images = adain_model.decoder(t)
        for reconstructed_image in reconstructed_images:
            prediction = deprocess_image(reconstructed_image)
            cv2.imwrite("../adain_images/" + str(i) + ".jpg", prediction.astype(np.uint8))
            i += 1
            if i%500 == 0:
                print(i)
    shutil.make_archive("/kaggle/working/adain_images", 'zip', "/kaggle/adain_images")
    
save_adain_submission()

## Conclusion

Adaptive Instance Normalization allows arbitrary style transfer in
real time, and it take just a few minutes to generate 7k needed images.
It is also important to note that the novel proposition of
the authors is to achieve this only by aligning the statistical
features (mean and standard deviation) of the style and the content
images.

Possible improvements could include finding a pretrained model and fine-tuning it only for Monet images.

---------------------------------------
# 3. CycleGANs

This section is based on [two](https://towardsdatascience.com/cyclegan-learning-to-translate-images-without-paired-training-data-5b4e93862c8d) nice [blogs about CycleGANs](https://hardikbansal.github.io/CycleGANBlog/) and a [notebook](https://www.kaggle.com/code/dimitreoliveira/improving-cyclegan-monet-paintings/notebook) with a great collection of links for possible improvements, where most of the figures, explanations and code were adopted from.

Apart from those, there are couple more useful resources:
* [Nice and brief explanation of CycleGANs](https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/assignments/a4-handout.pdf)
* [CycleGAN project website](https://junyanz.github.io/CycleGAN/)
* [Paper](https://arxiv.org/pdf/1703.10593.pdf)
* [Original implementation](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix)

CycleGAN is a model that aims to solve the image-to-image translation problem. The goal of the image-to-image translation problem is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, obtaining paired examples isn't always feasible. CycleGAN tries to learn this mapping without requiring paired input-output images, using cycle-consistency: if we transform from source distribution to target and then back again to source distribution, we should get samples from our source distribution.

## Architecture

CycleGAN is a Generative Adversarial Network (GAN) that uses two generators and two discriminators.
We call one generator $G_{A\rightarrow B}$. It converts images from the $A$ domain to the $B$ domain. The other generator is called $G_{B\rightarrow A}$, and converts images from $B$ to $A$.

Each generator has a corresponding discriminator, which attempts to tell apart its synthesized images from real ones.

![CycleGAN architecture illustration forward](https://hardikbansal.github.io/CycleGANBlog/images/model.jpg)

------------------------------------------------

![CycleGAN architecture illustration reverse](https://hardikbansal.github.io/CycleGANBlog/images/model1.jpg)

### Generator
The generator in the CycleGAN has layers that implement three stages of computation: 1) the first
stage encodes the input via a series of convolutional layers that extract the image features; 2) the
second stage then transforms the features by passing them through one or more residual blocks;
and 3) the third stage decodes the transformed features using a series of transpose convolutional
layers, to build an output image of the same size as the input.

![Genarator architecture](https://hardikbansal.github.io/CycleGANBlog/images/Generator.jpg)

#### Encoder
Each block applies convolutional filter while also reducing data resolution and increasing the number of features.

In [None]:
import tensorflow_addons as tfa

conv_initializer = tf.random_normal_initializer(mean=0.0, stddev=0.02)
gamma_initializer = keras.initializers.RandomNormal(mean=0.0, stddev=0.02)
    
def encode(input_layer, filters, size=3, strides=2, apply_instancenorm=True, activation=keras.layers.ReLU(), name='x'):
    block = keras.layers.Conv2D(filters, size, strides=strides, padding='same', 
                     use_bias=False, kernel_initializer=conv_initializer, 
                     name=f'encoder_{name}')(input_layer)
    if apply_instancenorm:
        block = tfa.layers.InstanceNormalization(gamma_initializer=gamma_initializer)(block)
    block = activation(block)
    return block

#### Transformation
Each block applies two convolutional filters and adds a residual connection to find relevant data patterns while keeping constant dimension.

In [None]:
def transform(input_layer, size=3, strides=1, name='x'):
    filters = input_layer.shape[-1]
    block = keras.layers.Conv2D(filters, size, strides=strides, padding='same', use_bias=False, 
                     kernel_initializer=conv_initializer, name=f'transformer_{name}_1')(input_layer)
    block = keras.layers.ReLU()(block)
    block = keras.layers.Conv2D(filters, size, strides=strides, padding='same', use_bias=False, 
                     kernel_initializer=conv_initializer, name=f'transformer_{name}_2')(block)
    block = keras.layers.Add()([block, input_layer])
    return block

#### Decoder
Each block applies deconvolutional filter which increase data resolution and decrease the number of features.

In [None]:
def decode(input_layer, filters, size=3, strides=2, apply_instancenorm=True, name='x'):
    block = keras.layers.Conv2DTranspose(filters, size,  strides=strides, padding='same', 
                              use_bias=False, kernel_initializer=conv_initializer, 
                              name=f'decoder_{name}')(input_layer)
    if apply_instancenorm:
        block = tfa.layers.InstanceNormalization(gamma_initializer=gamma_initializer)(block)
    block = keras.layers.ReLU()(block)
    return block

#### Generator
The architecture was proposed by the authors of CycleGAN paper, and [it was shown](https://www.kaggle.com/code/dimitreoliveira/improving-cyclegan-monet-paintings/notebook) that skipping normalization of the first layer and adding residual connection between encoder and decoder improve performance.

In [None]:
def get_generator(height=IMAGE_SIZE, width=IMAGE_SIZE, channels=3, transformer_blocks=6):
    inputs = keras.layers.Input(shape=[height, width, channels], name='input_image')
    # Encoder
    enc_1 = encode(inputs, 64,  7, 1, apply_instancenorm=False, name='block_1') # (bs, 256, 256, 64)
    enc_2 = encode(enc_1, 128, 3, 2, apply_instancenorm=True, name='block_2')   # (bs, 128, 128, 128)
    enc_3 = encode(enc_2, 256, 3, 2, apply_instancenorm=True, name='block_3')   # (bs, 64, 64, 256)
    # Transformer
    x = enc_3
    for n in range(transformer_blocks):
        x = transform(x, 3, 1, name=f'block_{n+1}') # (bs, 64, 64, 256)
    # Decoder
    x_skip = keras.layers.Concatenate(name='enc_dec_skip_1')([x, enc_3]) # encoder - decoder skip connection
    dec_1 = decode(x_skip, 128, 3, 2, apply_instancenorm=True, name='block_1') # (bs, 128, 128, 128)
    x_skip = keras.layers.Concatenate(name='enc_dec_skip_2')([dec_1, enc_2]) # encoder - decoder skip connection
    dec_2 = decode(x_skip, 64,  3, 2, apply_instancenorm=True, name='block_2') # (bs, 256, 256, 64)
    x_skip = keras.layers.Concatenate(name='enc_dec_skip_3')([dec_2, enc_1]) # encoder - decoder skip connection
    outputs = last = keras.layers.Conv2D(channels, 7, strides=1, padding='same', 
                              kernel_initializer=conv_initializer, use_bias=False, 
                              activation='tanh', name='decoder_output_block')(x_skip) # (bs, 256, 256, 3)
    return keras.Model(inputs, outputs)

### Discriminator
The discriminators are fully convolutional neural networks that look at a “patch” of the input image, and output the probability of the patch being “real”. This is both more computationally efficient than trying to look at the entire input image, and is also more effective — it allows the discriminator to focus on more surface-level features, like texture, which is often the sort of thing being changed in an image translation task.

![Discriminator architecture](https://miro.medium.com/max/1400/1*46CddTc5JwkFW_pQb4nGZQ.png)

In [None]:
def get_discriminator(height=IMAGE_SIZE, width=IMAGE_SIZE, channels=3):
    inputs = keras.layers.Input(shape=[height, width, channels], name='input_image')
    x = encode(inputs, 64,  4, 2, apply_instancenorm=False, activation=keras.layers.LeakyReLU(0.2), name='block_1') # (bs, 128, 128, 64)
    x = encode(x, 128, 4, 2, apply_instancenorm=True, activation=keras.layers.LeakyReLU(0.2), name='block_2')       # (bs, 64, 64, 128)
    x = encode(x, 256, 4, 2, apply_instancenorm=True, activation=keras.layers.LeakyReLU(0.2), name='block_3')       # (bs, 32, 32, 256)
    x = encode(x, 512, 4, 1, apply_instancenorm=True, activation=keras.layers.LeakyReLU(0.2), name='block_4')       # (bs, 32, 32, 512)
    outputs = keras.layers.Conv2D(1, 4, strides=1, padding='valid', kernel_initializer=conv_initializer)(x)                # (bs, 29, 29, 1)
    discriminator = keras.Model(inputs, outputs)
    return discriminator

## Objective Function
There are two components to the CycleGAN objective function, an _adversarial loss_ and a _cycle consistency loss_. Both are essential to getting good results.
We use the least squares loss here (found by [Mao et al.](https://arxiv.org/abs/1611.04076) to be more effective than the typical log likelihood loss).

__Adversarial loss__:

Discriminator:

$L^D_{real} = \frac{1}{2} \left(\frac{1}{n}\sum_{i=1}^{n}{(D_A(a_i) - 1)^2} + \frac{1}{n}\sum_{j=1}^{n}{(D_B(b_i) - 1)^2}\right)$

$L^D_{fake} = \frac{1}{2} \left(\frac{1}{n}\sum_{i=1}^{n}{D_B(G_{A\rightarrow B}(a_i))^2} + \frac{1}{n}\sum_{i=1}^{n}{D_A(G_{B\rightarrow A}(b_i))^2}\right)$

Generator:

$L^{G_{A\rightarrow B}}_{adv} = \frac{1}{n}\sum_{i=1}^{n}{(D_B(G_{A\rightarrow B}(a_i))-1)^2}$

$L^{G_{B\rightarrow A}}_{adv} = \frac{1}{n}\sum_{i=1}^{n}{(D_A(G_{B\rightarrow A}(b_i))-1)^2}$

__Cycle consistency loss__:

$L^{A\rightarrow B\rightarrow A}_{cycle} = \frac{1}{n}\sum_{j=1}^{n}{|a_i - G_{B\rightarrow A}(G_{A\rightarrow B}(a_i))|}$

$L^{B\rightarrow A\rightarrow B}_{cycle} = \frac{1}{n}\sum_{j=1}^{n}{|b_i - G_{A\rightarrow B}(G_{B\rightarrow A}(b_i))|}$

$L^{A\rightarrow B}_{identity} = \frac{1}{n}\sum_{j=1}^{n}{|b_i - G_{A\rightarrow B}(b_i)|}$

$L^{B\rightarrow A}_{identity} = \frac{1}{n}\sum_{j=1}^{n}{|a_i - G_{B\rightarrow A}(a_i)|}$

__Total generator loss__:

$L^{G_{A\rightarrow B}}_{total} = L^{G_{A\rightarrow B}}_{adv} + \lambda \left(L^{A\rightarrow B\rightarrow A}_{cycle} + L^{B\rightarrow A\rightarrow B}_{cycle}\right) + \frac{1}{2} \lambda L^{A\rightarrow B}_{identity}$

$L^{G_{B\rightarrow A}}_{total} = L^{G_{B\rightarrow A}}_{adv} + \lambda \left(L^{B\rightarrow A\rightarrow B}_{cycle} + L^{A\rightarrow B\rightarrow A}_{cycle}\right) + \frac{1}{2} \lambda L^{B\rightarrow A}_{identity}$

In [None]:
adv_loss_fn = keras.losses.MeanSquaredError()
# Discriminator loss {0: fake, 1: real} (The discriminator loss outputs the average of the real and generated loss)
def discriminator_loss(real, generated):
    real_loss = adv_loss_fn(tf.ones_like(real), real)
    generated_loss = adv_loss_fn(tf.zeros_like(generated), generated)
    total_disc_loss = real_loss + generated_loss
    return total_disc_loss * 0.5

# Generator loss
def generator_loss(generated):
    return adv_loss_fn(tf.ones_like(generated), generated)


# Cycle consistency loss (measures if original photo and the twice transformed photo to be similar to one another)
def calc_cycle_loss(real_image, cycled_image, Lambda):
    loss1 = tf.reduce_mean(tf.abs(real_image - cycled_image))
    return Lambda * loss1

# Identity loss (compares the image with its generator (i.e. photo with photo generator))
def identity_loss(real_image, same_image, Lambda):
    loss = tf.reduce_mean(tf.abs(real_image - same_image))
    return Lambda * 0.5 * loss

## CycleGAN Model
We will override the `train_step()` method of the Model class for training via `fit()`.

In [None]:
class CycleGan(keras.Model):
    def __init__(self, monet_generator, photo_generator, 
                 monet_discriminator, photo_discriminator, lambda_cycle=10):
        super(CycleGan, self).__init__()
        self.monet_generator = monet_generator
        self.photo_generator = photo_generator
        self.monet_discriminator = monet_discriminator
        self.photo_discriminator = photo_discriminator
        self.lambda_cycle = lambda_cycle
        
    def compile(self, monet_generator_optimizer, photo_generator_optimizer,
                monet_discriminator_optimizer, photo_discriminator_optimizer,
                gen_loss_fn, disc_loss_fn, cycle_loss_fn, identity_loss_fn):
        super(CycleGan, self).compile()
        self.monet_generator_optimizer = monet_generator_optimizer
        self.photo_generator_optimizer = photo_generator_optimizer
        self.monet_discriminator_optimizer = monet_discriminator_optimizer
        self.photo_discriminator_optimizer = photo_discriminator_optimizer
        self.gen_loss_fn = gen_loss_fn
        self.disc_loss_fn = disc_loss_fn
        self.cycle_loss_fn = cycle_loss_fn
        self.identity_loss_fn = identity_loss_fn
        
    def train_step(self, batch_data):
        real_monet, real_photo = batch_data
        with tf.GradientTape(persistent=True) as tape:
            # photo to monet back to photo
            fake_monet = self.monet_generator(real_photo, training=True)
            cycled_photo = self.photo_generator(fake_monet, training=True)
            # monet to photo back to monet
            fake_photo = self.photo_generator(real_monet, training=True)
            cycled_monet = self.monet_generator(fake_photo, training=True)
            # generating itself
            same_monet = self.monet_generator(real_monet, training=True)
            same_photo = self.photo_generator(real_photo, training=True)
            # discriminator used to check, inputing real images
            disc_real_monet = self.monet_discriminator(real_monet, training=True)
            disc_real_photo = self.photo_discriminator(real_photo, training=True)
            # discriminator used to check, inputing fake images
            disc_fake_monet = self.monet_discriminator(fake_monet, training=True)
            disc_fake_photo = self.photo_discriminator(fake_photo, training=True)
            # evaluate generator loss
            monet_gen_loss = self.gen_loss_fn(disc_fake_monet)
            photo_gen_loss = self.gen_loss_fn(disc_fake_photo)
            # evaluate total cycle consistency loss
            total_cycle_loss = self.cycle_loss_fn(real_monet, cycled_monet, self.lambda_cycle) + \
                self.cycle_loss_fn(real_photo, cycled_photo, self.lambda_cycle)
            # evaluate total generator loss
            total_monet_gen_loss = monet_gen_loss + total_cycle_loss + self.identity_loss_fn(real_monet, same_monet, self.lambda_cycle)
            total_photo_gen_loss = photo_gen_loss + total_cycle_loss + self.identity_loss_fn(real_photo, same_photo, self.lambda_cycle)
            # evaluate discriminator loss
            monet_disc_loss = self.disc_loss_fn(disc_real_monet, disc_fake_monet)
            photo_disc_loss = self.disc_loss_fn(disc_real_photo, disc_fake_photo)
        # Calculate the gradients for generator and discriminator
        monet_generator_gradients = tape.gradient(total_monet_gen_loss, 
                                                self.monet_generator.trainable_variables)
        photo_generator_gradients = tape.gradient(total_photo_gen_loss,
                                                  self.photo_generator.trainable_variables)
        monet_discriminator_gradients = tape.gradient(monet_disc_loss,
                                                      self.monet_discriminator.trainable_variables)
        photo_discriminator_gradients = tape.gradient(photo_disc_loss,
                                                      self.photo_discriminator.trainable_variables)
        # Apply the gradients to the optimizer
        self.monet_generator_optimizer.apply_gradients(zip(monet_generator_gradients,
                                                 self.monet_generator.trainable_variables))
        self.photo_generator_optimizer.apply_gradients(zip(photo_generator_gradients,
                                                 self.photo_generator.trainable_variables))
        self.monet_discriminator_optimizer.apply_gradients(zip(monet_discriminator_gradients,
                                                  self.monet_discriminator.trainable_variables))
        self.photo_discriminator_optimizer.apply_gradients(zip(photo_discriminator_gradients,
                                                  self.photo_discriminator.trainable_variables))
        return {'monet_gen_loss': total_monet_gen_loss,
                'photo_gen_loss': total_photo_gen_loss,
                'monet_disc_loss': monet_disc_loss,
                'photo_disc_loss': photo_disc_loss
               }

## Data and tf.data pipeline

In [None]:
def normalize_img(img):
    img = tf.cast(img, dtype=tf.float32)
    return (img / 127.5) - 1.0

BATCH_SIZE = 1
monet_ds = (keras.utils.image_dataset_from_directory(MONET_PATH_JPG, labels=None).unbatch()
            .map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE).repeat())
photo_ds = (keras.utils.image_dataset_from_directory(PHOTO_PATH_JPG, labels=None).unbatch()
            .map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE))
gan_ds = tf.data.Dataset.zip((monet_ds, photo_ds)).shuffle(BATCH_SIZE * 2).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

visualize_ds(gan_ds, deprocess=False)

## Training Monitor callback

In [None]:
def plot_cycle(example_sample, generator_a, generator_b, n_samples=1):
    fig, axes = plt.subplots(n_samples, 3, figsize=(22, (n_samples*6)))
    axes = axes.flatten()
    for n_sample in range(n_samples):
        idx = n_sample*3
        generated_a_sample = generator_a.predict(example_sample)
        generated_b_sample = generator_b.predict(generated_a_sample)
        axes[idx].set_title('Input image', fontsize=18)
        axes[idx].imshow(example_sample[0] * 0.5 + 0.5)
        axes[idx].axis('off')
        axes[idx+1].set_title('Generated image', fontsize=18)
        axes[idx+1].imshow(generated_a_sample[0] * 0.5 + 0.5)
        axes[idx+1].axis('off')
        axes[idx+2].set_title('Cycled image', fontsize=18)
        axes[idx+2].imshow(generated_b_sample[0] * 0.5 + 0.5)
        axes[idx+2].axis('off')
    plt.show()
    plt.close()

    
class GANMonitor(keras.callbacks.Callback):
    """A callback to generate and save images after each epoch"""
    def __init__(self, monet_path='monet', photo_path='photo'):
        self.test_monet, self.test_photo = next(iter(gan_ds))
        self.monet_path = monet_path
        self.photo_path = photo_path
        # Create directories to save the generate images
        if not os.path.exists(self.monet_path):
            os.makedirs(self.monet_path)
        if not os.path.exists(self.photo_path):
            os.makedirs(self.photo_path)

    def on_epoch_end(self, epoch, logs=None):
        # Monet generated images
        prediction = self.model.monet_generator(self.test_photo, training=False)[0].numpy()
        prediction = (prediction * 127.5 + 127.5).astype(np.uint8)
        cv2.imwrite(f'{self.monet_path}/generated_{epoch+1}.jpg', prediction)        
        # Photo generated images
        prediction = self.model.photo_generator(self.test_monet, training=False)[0].numpy()
        prediction = (prediction * 127.5 + 127.5).astype(np.uint8)
        cv2.imwrite(f'{self.photo_path}/generated_{epoch+1}.jpg', prediction)
        # Plot the Style, Content and the NST image
        plot_cycle(self.test_photo, self.model.monet_generator, self.model.photo_generator)
        plot_cycle(self.test_monet, self.model.photo_generator, self.model.monet_generator)

## Training the model

In [None]:
TRANSFORMER_BLOCKS = 6
# Networks
monet_generator = get_generator(height=None, width=None, transformer_blocks=TRANSFORMER_BLOCKS) # transforms photos to Monet-esque paintings
photo_generator = get_generator(height=None, width=None, transformer_blocks=TRANSFORMER_BLOCKS) # transforms Monet paintings to be more like photos
monet_discriminator = get_discriminator(height=None, width=None) # differentiates real Monet paintings and generated Monet paintings
photo_discriminator = get_discriminator(height=None, width=None) # differentiates real photos and generated photos
# Optimizers
lr = 2e-7
monet_generator_optimizer = keras.optimizers.Adam(learning_rate=lr, beta_1=0.5)
photo_generator_optimizer = keras.optimizers.Adam(learning_rate=lr, beta_1=0.5)
monet_discriminator_optimizer = keras.optimizers.Adam(learning_rate=lr, beta_1=0.5)
photo_discriminator_optimizer = keras.optimizers.Adam(learning_rate=lr, beta_1=0.5)
# Create GAN
gan_model = CycleGan(monet_generator, photo_generator, 
                     monet_discriminator, photo_discriminator)
gan_model.compile(monet_generator_optimizer, photo_generator_optimizer,
                  monet_discriminator_optimizer, photo_discriminator_optimizer,
                  generator_loss, discriminator_loss, calc_cycle_loss, identity_loss)
gan_monitor_cb = GANMonitor()
EPOCHS = 2
history = gan_model.fit(gan_ds, epochs=EPOCHS, callbacks=[gan_monitor_cb], steps_per_epoch=(300//2//BATCH_SIZE))

In [None]:
monet_generator.save('monet_generator.h5')
photo_generator.save('photo_generator.h5')
monet_discriminator.save('monet_discriminator.h5')
photo_discriminator.save('photo_discriminator.h5')

### Load pretrained model

In [None]:
monet_generator = keras.models.load_model("/kaggle/input/different-approaches-to-style-transfer-v3-cyclegan/monet_generator.h5") # transforms photos to Monet-esque paintings
photo_generator = keras.models.load_model("/kaggle/input/different-approaches-to-style-transfer-v3-cyclegan/photo_generator.h5") # transforms Monet paintings to be more like photos
monet_discriminator = keras.models.load_model("/kaggle/input/different-approaches-to-style-transfer-v3-cyclegan/monet_discriminator.h5") # differentiates real Monet paintings and generated Monet paintings
photo_discriminator = keras.models.load_model("/kaggle/input/different-approaches-to-style-transfer-v3-cyclegan/photo_discriminator.h5") # differentiates real photos and generated photos
# Create GAN
gan_model = CycleGan(monet_generator, photo_generator, 
                     monet_discriminator, photo_discriminator)
gan_model.compile(monet_generator_optimizer, photo_generator_optimizer,
                  monet_discriminator_optimizer, photo_discriminator_optimizer,
                  generator_loss, discriminator_loss, calc_cycle_loss, identity_loss)
#EPOCHS = 1
#history = gan_model.fit(gan_ds, epochs=EPOCHS, callbacks=[gan_monitor_cb], steps_per_epoch=(300//BATCH_SIZE))

## Inference

In [None]:
for _ in range(3):
    plot_cycle(next(iter(photo_ds.batch(1))), monet_generator, photo_generator, n_samples=1)

In [None]:
def display_generated_samples(ds, model, n_samples):
    ds_iter = iter(ds)
    for n_sample in range(n_samples):
        example_sample = next(ds_iter)
        generated_sample = model.predict(example_sample)
        f = plt.figure(figsize=(12, 12))
        plt.subplot(121)
        plt.title('Input image')
        plt.imshow(example_sample[0] * 0.5 + 0.5)
        plt.axis('off')
        plt.subplot(122)
        plt.title('Generated image')
        plt.imshow(generated_sample[0] * 0.5 + 0.5)
        plt.axis('off')
        plt.show()
        
display_generated_samples(photo_ds.batch(1).take(8), monet_generator, 8)

## Create submission file

In [None]:
! mkdir ../gan_images

In [None]:
import shutil

def save_cycleGAN_submission():
    i = 1
    for img in photo_ds.batch(1):
        prediction = monet_generator.predict(img)[0]
        prediction = (prediction * 127.5 + 127.5).astype(np.uint8)   # re-scale
        cv2.imwrite("../gan_images/" + str(i) + ".jpg", prediction)
        i += 1
        if i%500 == 0:
            print(i)
    shutil.make_archive("/kaggle/working/images", 'zip', "/kaggle/gan_images")
    
save_cycleGAN_submission()

---------------------------------------
# Final conclusions and future work

Judging from the other notebooks in the competition, it looks like properly trained CycleGAN performs much better than traditional NST methods.
However, it is quite difficult to train GANs properly.
So, it is worth exploring the ways to improve our initial model presented here.

On the other hand, the objective function of this competition does not take into account any content images, i.e. it only measures how similar the style of the generated image is to the Monet style.
That could be the reason why images from AdaIN look much better to the eye, but get much lower score.

Plans for the future work:
* Learn about further developments of CycleGANs and other more recent methods
* Learn how to use TPUs