# Style transfer

We have seen that CNN's are some of the most powerful networks for image classification and analysis. CNN's process visual information in a feed forward manner, passing an input image through a collection of image filters which extract certain features from the input image. It turns out that these feature level representations are not only useful for classification, but for image construction as well. These representations are the basis for applications like Style Transfer and Deep Dream, which compose images based on CNN layer activations and extracted features. 

In this module we will focus on learning about and implementing the style transfer algorithm. Style transfer allows us to apply the style of one image to another image of our choice. The key to this technique is using  atrained CNN to separate the content from the style of an image. If we can do this then we can merge the content of one image with the style of another and create sometnhing entirely different. 

Let's talk about how style and content can be separated and by the end of this notebook, we will have all the knowledge we need to generate a stylized image of our own design. 

# Separating Style and Content

When a CNN is trained to classify images, its convolutional layers learn to extract more and more complex features from a given image. Intermittently max pooling layers will discard detailed spatial information, information that is increasingly irrelevant to the task of classification. The effect of this is that as we go deeper into a CNN, the input image is transformed into feature maps that increasingly care about the content of the image rather than any detail about the texture and color of pixels. Later layers of a network are even sometimes referred to as a content representation of an image. 

<img src="assets/CNNTrainingFlow.png">

In this way a trained CNN has already learned to represent the content of an image, but what about style? Style can be thought of as traits that might be found in the brush strokes of a painting, its texture, colors, curvature, and so on. To perform style transer, we need to combine the content of one image with the style of another. So, how can we isolate only the style of an image? 

To represent the style of an input image, a feature space designed to capture texture and color information is used. This space essentially looks at spatial correlations within a layer of a network. A correlation is a measure of the relationship between two or more variables. For example, we could loot at the features extracted in the first convolutional layer which has some depth. The depth corresponds to the number of frature maps in that layer. For each feature map, wwe can measure how strongly its detected features relate to the other feature maps in that layer. Is a certain color detected in one map similar to a color in another map? What about the differences between detected edges and corners, and so on? 

See which colors and shapes in a set of feature maps are related and which are not. Say, we detect that mini-feature maps in the firt convolutional layer have similar pink edge features. If there are common colors and shapes among the feature maps, then this can be thought of as part of that image's style. So, the similarities and differences between features in a layer should give us some information about the texture and color information found in an image. But at the same time, it should leave out information about the actual arrangement and identity of different objects in that image. 

Now, we have seen that content and style can be separate components of an image. Let's think about this in a complete style transfer example. Style transfer will look at two different images. We often call these the style image and the content image. Using a trained CNN, style transfer finds the style of one image and the content of the other. Finally, it tries to merge the two to create a new third image. In this newly created image, the objects and their arrangement are taken from the content image, and the colors and textures are taken from the style image.

This is the theory behind how style transfer works. Next, let's talk more about how we can actually extract features from different layers of a trained model and use them to combine the style and content of two different images. 

# VGG19 and Content Loss

In the code example that we will go through, we will recreate a style transfer method that is outlined in the paper, image style transfer using convolutional neural networks. In this paper, style transfer uses the features found in the 19 layers of VGG network, which we will call VGG19. This network accepts a color image as input and passes it through a series of convolutional and pooling layers. Followed finally by a three fully connected layers but classify the past in image. In-between the five pooling layers, there are stacks of two or four convolutional layers. The depth of these layers is standard within each stack, but increases after each pooling layer. They are named by stack and their order in the stack. Conv1_1 is the first convolutional layer that an image is passed throughin the first stack. Conv2_1 is the first convolutional layer in the second stack. the deepest convolutional layer in the network is conv5_4.

<img src="assets/VGG19_layers.png">

Now, we know that style transfer want to create an image that has the content of one image and the style of another. To create the Content image, which we will call our target image, it will first pass both the content and style images through the VGG19 network. 

First, when the network sees the content image, it will go through the feed-forward process until it gets to a convolutional layer that is deep in the network. The output of this layer will be the content representation of the input image. 

<img src="assets/ContentImageVGG19.png">

Next, when it sees the style image, it will stract different features from multiple layers that represent the style of that image. 

<img src="assets/StyleImageVGG19.png">

Finally, it will use both the content and style representations to inform the creation of the target image. The challenge is how to create the target image. How can we take a target image which often starts as either a blank canvas or as a copy of our content image, and manipulate it so that its content is close to that of our content image, and its style is close to that of our style image? 

Let's start by discussing in the content. In the paper, the content representation for an image is taken as the output from the fourth convolutional stack conv4_2. As we form our new target image, we will compare its content representation with that of our content image. These two representations should be close to the same even as our target image changes its style. 

To formalize this comparison, we will define a content loss, a loss that calculates the difference between the content and target image representations, which we will call CC and TC respectively. In this case, we calculate the mean squared difference between the two represeentations. This is our content loss, and it measures how far away these two representations are from one another. 

<img src="assets/ContentLoss.png">

As we try to create the best target image, our aim will be to minimize this loss. This is similar to how we used loss and optimization to determine the weights of a CNN during training. But this time, our aim is not to minimize classification error. In fact, we are not training the CNN at all. Rather our goal is to change only the target image, updating its appearance until its content representation matches that of our content image. 

So we are not using the VGG19 network in a traditional sense, we are not training it to produce a specific output. But we are using it as a feature extractor, and using backpropagation to minimize a defined loss function between our target and content images. In fact, we will have to define a loss function between our target and style images, in order to produce an image with our desired style. Next, let's learn more about how to represent the style of an image.

### An Useful Resource: [Image Style Transfer Using Convolutional Neural Networks](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Gatys_Image_Style_Transfer_CVPR_2016_paper.pdf)

# Gram Matrix

To make sure that our target image has the same content as our content image, we formalize the idea of a content loss, that compares the content representations of the two images. Next, we want to do the same thing for the style repreesentations of our target image and style image. The style representation of an image relies on looking at correlations between the features in individual layers of the VGG19 network. In other words, looking at how similar the features in a single layer are. Similarities will include the general colors and textures found in that layer. 

We typically find the similarities between features in multiple layers in the network. By including the correlations between multiple layers of different sizes, we can obtain a multiscale style representation of the input image, one that captures large and small style features. 

<img src="assets/MultiscaleCNN.png">

The style representation is calculated as an image passes through the network at the first convolutional layer in all five stacks, conv1_1, conv2_1, up to conv5_1.

The correlations at each layer are given by a gram matrix. The matrix is a result of a couple of operations, and it is easiest to see in a simple example. Say we start off with a 4x4 image, and we convolve it with eight different image filters to create a convolutional layer. This layer will be 4x4 in height and width, and 8 in depth. Thinking about the style representation for this layer, we can say that this layer has eight feature maps that we want to find the relationships between. 

The first step in calculating the Gram matrix, will be to vectorize the values in this layer. This is very similar to what we have seen before, in the case of veectorizing an image so that it can be seen by an NLP. The first row of 4 values in the feature map, will become the first four values in a vector with length 16. The last row will be the last four values in that vector. 

<img src="assets/GramVectorizing.png">

By flattening the X, Y dimensions of the feature maps, we are converting a 3D convolutional layer into a 2D matrix of values. The next step is to multiply this matrix by its transpose. Essentially, by multiplying the features in each map to get the gram matrix. This operation threats each value in the feature map as an individual sample, unrelated in space to other values. So, the resultand Gram matrix contains non-localized information about the layer. Non-localized information, is information that would still be there even if an image was shuffled around in space. For example, even if the content of a filtered image is not identifiable, we should still be able to see prominent colors and texture the style. 

<img src="assets/SecondStepGram.png">

Finanlly, we are left with the square 8x8 Gram Matrix, whose values indicate the similarities between the layers.

So, F row 4x2, will hold a value that indicates the similarity between the fourth and second feature maps in a layer.
Importantly, the dimensions of this matrix are related only to the number of feature maps in the convolutional layer, it doesn't depend on the dimensions of the input image. 

<img src="assets/ResultGramExample.png">

We should note that the Gram matrix is just one mathematical way of representing the idea of shared in prominent styles. Style itself is an abstract idea but the Gram matrix is the most widely used in practice. 

<img src="assets/QuizGramMatrix.png">
Yes! When the height and width (8 x 8) are flattened, the resultant 2D matrix will have as many columns as the height and width, multiplied: 
$8*8 = 64$

<img src="assets/QuizGramMatrix2.png">
Yes! The Gram matrix will be a square matrix, with a width and height = to the depth of the convolutional layer in question.

Now that we have deifned the Gram matrix as having information about the style of a given layer, next we can calculate a style loss that compares the style of our target image and our style image. 