# Neural Style Transfer

Neural Style Transfer is a technique that uses a neural network to generate an image that is a combination of the content of one image and the style of another image. The technique was introduced by Gatys et al. in the paper [A Neural Algorithm of Artistic Style](https://arxiv.org/abs/1508.06576).


- The paper uses the VGG19 model

![VGG19](assets/vgg19.png)

- what we will do is to take that model and freeze all of it, we will pass the content image, the style image and a random image through the model, then we will calculate the content loss (the difference between the content image and the random image) and the style loss (the difference between the style image and the random image), then we use these losses to update the random image to minimize the content and style losses.
    - the content loss is designed so that the random image will have the same content as the content image
    - the style loss is designed so that the random image will have the same style as the style image


- Content Loss:
    - The content loss is the mean squared error between the feature maps of the content image and the random image at a certain layer of the model. 
    - Andrew Ng explains that this layer is chosen in intermediate layers of the model because the early layers capture low-level features like edges and textures, and the later layers capture high-level features like objects and scenes. but the intermediate layers captures something in between.
    - the content loss is calculated as follows:
        - $L_{content} = \frac{1}{2} \sum (F_{content} - F_{random})^2$ where $F_{content}$ and $F_{random}$ are the feature maps (activations) of the content image and the random image at a certain layer of the model.

- Style Loss:
    - The style loss is the mean squared error (L2 distance in Vectorized form) between the Gram matrices of the feature maps of the style image and the random image
    - the gram matrix is simply the correlation matrix of the feature maps
        - if we have a feature map of shape (H, W, C) then the gram matrix will be of shape (C, C) where C is the number of channels, each element in the gram matrix (say $G_{ij}$) is the dot product between the vectorized version of the $i^{th}$ channel and the $j^{th}$ channel of the feature map
    - we calculate the style loss for each layer of the model and sum them up to get the total style loss (the paper uses 5 layers to calculate the style loss, they are: 'block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1', 'block5_conv1') basically the first convolutional layer of each block in the model.
    - the style loss is calculated as follows:
        - $L_{style} = \sum \frac{1}{4 \times C^2 \times H^2 \times W^2} \sum (G_{style} - G_{random})^2$ where $G_{style}$ and $G_{random}$ are the gram matrices of the style image and the random image at a certain layer of the model.
        - the $\frac{1}{4 \times C^2 \times H^2 \times W^2}$ is a normalization factor to make the style loss independent of the size of the feature maps.
        - The total style loss is the sum of the style losses of the 5 layers.

- Total Loss:
    - The total loss is the sum of the content loss and the style loss multiplied by their respective weights ($\alpha$ and $\beta$)
    - $L_{total} = \alpha \times L_{content} + \beta \times L_{style}$

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import transforms, models
from PIL import Image
from torchvision.utils import save_image
from tqdm import tqdm

In [2]:
model = models.vgg19(weights=None)
print(model)

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padd

- the first Convolutional layer of each block in the VGG19 model is the one after the max pooling layer
    - from the above model description we see that their indices are: 0, 5, 10, 19, 28

In [3]:
class VGG(nn.Module):
    def __init__(self):
        super(VGG, self).__init__()
        self.chosen_layers = ['0', '5', '10', '19', '28']
        self.model = models.vgg19(weights=models.VGG19_Weights.DEFAULT).features[:29] # we remove all layers after 28th layer (we don't need them)

    def forward(self, x):
        activations = []
        # loop on the model features (layers)
        for layer_num, layer in enumerate(self.model):
            # run the input through the layers 
            x = layer(x)
            # if the layer was one of the chosen layers, add its output to the activations list
            if str(layer_num) in self.chosen_layers:
                activations.append(x)
        
        return activations # will contain the outputs of the chosen layers
    

def load_image(image_name):
    image = Image.open(image_name)
    image = transform(image).unsqueeze(0) # add a dimension at the beginning of the tensor (batch size)
    return image.to(device)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# specify the image size (the better the GPU, the bigger the image size can be)
imsize = 512 # this is important as we need Content, Style and Generated images to have the same size (in order to perform the loss calculations)

transform = transforms.Compose([
    transforms.Resize((imsize, imsize)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # normalization for VGG19, he said he tried it without normalization and it worked well as well, but if we normalize, we need to denormalize the generated image before saving it (multiply by std and add mean)
])

- some people start with the original content image as the random image (instead of starting with a random image), this is called the "fast neural style transfer" technique, it is faster because the random image will start with the content image and will be updated to minimize the style loss only, this is useful when we want to apply the style of the style image to the content image.
     - this is what we will do

In [4]:
original_img = load_image("/kaggle/input/nst-images/content2.jfif")
style_images = ['/kaggle/input/nst-images/style1.jfif','/kaggle/input/nst-images/style2.jpg','/kaggle/input/nst-images/style3.jfif','/kaggle/input/nst-images/style4.jfif','/kaggle/input/nst-images/style5.jfif','/kaggle/input/nst-images/style6.jfif']

for i, path in enumerate(style_images):
    #generated  = torch.randn(original_img.shape, device=device, requires_grad=True) # random image
    generated = original_img.clone().requires_grad_(True) # clone the original image and set requires_grad to True to optimize it

    # Hyper parameters
    total_steps = 6000
    learning_rate = 0.001
    alpha = 1 # content loss hyperparameter
    beta = 10 # style loss hyperparameter

    # we set the optimizer to optimize the generated image
    optimizer = optim.Adam([generated], lr=learning_rate)


    model = VGG().to(device).eval() # we don't want to train the model, just use it to get the activations of the chosen layers
    style_img = load_image(path)

    
    tk0 = tqdm(range(total_steps), total=total_steps)
    for step in tk0:
        # pass the 3 images through the model
        generated_features = model(generated)
        original_img_features = model(original_img)
        style_features = model(style_img)
        # the outputs above are a list of 5 activations (one for each chosen layer)

        # initialize the losses to 0 at the beginning of each step
        style_loss = 0
        content_loss = 0

        for gen_feature, orig_feature, style_feature in zip(generated_features, original_img_features, style_features):
            # get the dimensions of the current activations (feature maps)
            batch_size, channel, height, width = gen_feature.shape 

            ## compute the content loss
            content_loss += torch.mean((gen_feature - orig_feature) ** 2) #/ 2 
            ## compute the style loss
            # compute the gram matrix of the generated image
            G_generated = gen_feature.view(channel, height*width).mm(gen_feature.view(channel, height*width).t())
            # compute the gram matrix of the style image
            G_style = style_feature.view(channel, height*width).mm(style_feature.view(channel, height*width).t())
            style_loss += torch.mean((G_generated - G_style) ** 2) #/ (4 * (channel ** 2) * (width * height) ** 2)

        # compute the total loss
        total_loss = alpha * content_loss + beta * style_loss

        # back propagation 
        optimizer.zero_grad()
        total_loss.backward()

        # update the generated image
        optimizer.step()

        tk0.set_postfix(total_loss=total_loss.item(), content_loss=content_loss.item(), style_loss=style_loss.item())
        if step % 200 == 0:
            #print(total_loss.item())
            # detach the generated image from the graph and denormalize it
            denormalized = generated * torch.tensor([0.229, 0.224, 0.225],device=device).view(3, 1, 1) + torch.tensor([0.485, 0.456, 0.406],device=device).view(3, 1, 1)
            save_image(denormalized.clamp(0, 1), f"output{i}.png")

Downloading: "https://download.pytorch.org/models/vgg19-dcbb9e9d.pth" to /root/.cache/torch/hub/checkpoints/vgg19-dcbb9e9d.pth
100%|██████████| 548M/548M [00:03<00:00, 165MB/s]
100%|██████████| 6000/6000 [16:14<00:00,  6.16it/s, content_loss=75.9, style_loss=1.85e+7, total_loss=1.85e+8]
100%|██████████| 6000/6000 [16:12<00:00,  6.17it/s, content_loss=71.3, style_loss=6.71e+7, total_loss=6.71e+8]
100%|██████████| 6000/6000 [16:10<00:00,  6.18it/s, content_loss=69.1, style_loss=1.34e+7, total_loss=1.34e+8]
100%|██████████| 6000/6000 [16:12<00:00,  6.17it/s, content_loss=70, style_loss=9.5e+7, total_loss=9.5e+8]
100%|██████████| 6000/6000 [16:12<00:00,  6.17it/s, content_loss=84.8, style_loss=7.46e+7, total_loss=7.46e+8]
100%|██████████| 6000/6000 [16:11<00:00,  6.17it/s, content_loss=61.9, style_loss=3.22e+7, total_loss=3.22e+8]


- Tips 
    - we calculated the gram matrix as follows 
        - first we shaped the activations to be of shape (C, H*W)
        - then we multiplied the activations by their transpose (the transpose will be of shape (H*W, C) and the multiplication will be of shape (C, C)) and we will do that for both the style image and the random image
    - mine: changes from tha peper
        - we started with the content image as the random image
        - we used the first convolutional layer of each block to calculate the content loss as well (not just 1 layer as in the paper)
        - we didnt weight the style loss for the different layers (the paper used different weights for the different layers)