You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
where $\boldsymbol{x}\in{\mathbb{R}^m}$ is the input sample,
$\boldsymbol{W}\in{\mathbb{R}^{m\times{n}}}$
is the weight matrix,
$\boldsymbol{b}\in{\mathbb{R}^n}$
is the bias vector,
and $\boldsymbol{y}\in{\mathbb{R}^n}$
is the output of fully connected layer.
backward:
Define $\nabla_{\boldsymbol{y}} L $
as partial derivative of Loss function $L$
to ${\boldsymbol{y}}$ ("top_diff" in layers_1.py)
where $\boldsymbol{x}(i)$ is the input of softmax layer,
$\boldsymbol{\hat{y}}(i)$ is classification probability of position
$i$ in softmax layer. When $e^{\boldsymbol{x}(i)}$ is too large, formula above can be converted to avoid numerical overflow.
$$\boldsymbol{\hat{y}}(i) = \frac{e^{\boldsymbol{x}(i)-\max\limits_{n}\boldsymbol{x}(n)}}{\sum\limits_{j}{e^{\boldsymbol{x}(j)-\max\limits_{n}\boldsymbol{x}(n)}}} $$
forward:
The loss function of softmax layer is defined as:
$$ L = -\sum\limits_{i}\boldsymbol{y}(i)\ln\boldsymbol{\hat{y}}(i)$$
where $\boldsymbol{y}(i)$ is the position
$i$ of label vector $\boldsymbol{y}$.
Considering batch processing:
$$ L = -\frac{1}{p}\sum\limits_{i,j}\boldsymbol{Y}(i,j)\ln\boldsymbol{\hat{Y}}(i,j)$$
where $\boldsymbol{Y}(i,j)$
is the position $(i,j)$
of label matrix $\boldsymbol{Y}\in{\mathbb{R}^{p\times{n}}}$
,every row vector of $\boldsymbol{\hat{Y}}\in{\mathbb{R}^{p\times{n}}}$ corresponds to output of one sample in softmax layer.
$$\boldsymbol{\hat{Y}}(i,j)=\frac{e^{\boldsymbol{X}(i,j)-\max\limits_{n}\boldsymbol{X}(i,n)}}{\sum\limits_{l}{e^{\boldsymbol{X}(l,j)-\max\limits_{n}\boldsymbol{X}(l,n)}}}$$
Convolution Kernel $\boldsymbol{W}\in\mathbb{R}^{C_{in}\times{K}\times{K}\times{C_{out}}}$.
$C_{in}$ is the number of input channels,
$K$ is kernel size,
$C_{out}$ is the number of output channels.
The input feature map $X\in\mathbb{R}^{N\times{C_{in}}\times{H_{in}}\times{W_{in}}}$.
$N$ is the number of input samples.
$H_{in}$ and $W_{in}$ is height and width of feature map. Likely, the output feature map is $N\times{C_{out}}\times{H_{out}}\times{W_{out}}$. In addition, $p$ is padding size and $s$ is stride.
To obtain expected output in each layer, after image padding:
The input of max pooling $\boldsymbol{X}\in\mathbb{R}^{N\times{C}\times{H_{in}}\times{W_{in}}}$
and the output $\boldsymbol{Y}\in\mathbb{R}^{N\times{C}\times{H_{out}}\times{W_{out}}}$,
$$\boldsymbol{Y}(n,c,h,w) =\max\limits_{k_h,k_w\in{[1,K]}}\boldsymbol{X}(n,c,hs+k_h,ws+k_w)$$
backward:
$\nabla_{\boldsymbol{Y}}L\in{\mathbb{R}^{N\times{C}\times{H_{out}}\times{W_{out}}}}$ represents partial derivative of $L$ to output of max pooling layer.Considering max pooling has altered the shape of the feature map,find out the coordinate corresponding to max pooling window:
$$p(n,c,h,w)=[k_h,k_w]=arg\max_{k_h,k_w}\boldsymbol{X}(n,c,hs+k_h,ws+k_w)$$$$\nabla_{\boldsymbol{X}(n,c,hs+k_h,ws+k_w)}L=\nabla_{\boldsymbol{Y}}L(n,c,h,w) $$
Standard model can be acquired from vgg,so official dataset is unnecessary.
Code for loading test pictures as follows:
defload_image(image_dir):
input_image=scipy.misc.imread(image_dir)
input_image=scipy.misc.imresize(input_image,[224,224,3]) #unifies the size of the inputinput_image=np.array(input_image).astype(np.float32) #quantificationinput_image-=image_mean#separately calculatedinput_image=np.reshape(input_image,[1]+list(input_image.shape)) #input dim:[N=1,height=224,width=224,channel=3]input_image=np.transpose(input_image,[0,3,1,2]) #input dim:[N=1,channel=3,height=224,width=224]
Result
Classification result id=281,class category refers to
here
Suppose $\boldsymbol{X}^l\in{\mathbb{R}^{N\times{C}\times{H}\times{W}}}$
is the $l_{th}$
feature map of style transfer image, and $\boldsymbol{Y}^l\in{\mathbb{R}^{N\times{C}\times{H}\times{W}}}$
is the $l_{th}$
feature map of targeted content image.Centent loss can be represented by ${\boldsymbol{X}^l}$
and ${\boldsymbol{Y}^l}$.
$$L_{content}=\frac{1}{2NCHW}\sum\limits_{n,c,h,w}(\boldsymbol{X}^l(n,c,h,w)-\boldsymbol{Y}^l(n,c,h,w))^2$$
The content loss is the average Euclidean distance of all positions in feature maps.
The gradient of content loss to feature map can be calculated by:
$$\nabla_{\boldsymbol{X}^l}L_{content}(n,c,h,w)=\frac{1}{NCHW}(\boldsymbol{X}^l(n,c,h,w)-\boldsymbol{Y}^l(n,c,h,w))$$
In experiment, feature map of content image is chosen from output of ReLU layer after conv4_2.
Style Loss
Suppose $\boldsymbol{X}^l\in{\mathbb{R}^{N\times{C}\times{H}\times{W}}}$
is the $l_{th}$
feature map of style transfer image, and $\boldsymbol{Y}^l\in{\mathbb{R}^{N\times{C}\times{H}\times{W}}}$
is the $l_{th}$
feature map of targeted content image. In forward propagation,the style feature of style transfer image $(\boldsymbol{G})$
and targeted style image $(\boldsymbol{A})$
are calculated with Gram moment:
$$\boldsymbol{G}^l(n,i,j)=\sum\limits_{h,w}\boldsymbol{X}^l(n,i,h,w)\boldsymbol{X}^l(n,j,h,w) $$$$\boldsymbol{A}^l(n,i,j)=\sum\limits_{h,w}\boldsymbol{Y}^l(n,i,h,w)\boldsymbol{Y}^l(n,j,h,w) $$
where $n\in[1,N]$
indicates one sample,$i,j\in[1,C]$
corresponds to one channel.The style loss of $l_{th}$ layer:
$$L_{style}^l=\frac{1}{4NC^2H^2W^2}\sum\limits_{n,i,j}(\boldsymbol{G}^l(n,i,j)-\boldsymbol{A}^l(n,i,j))^2 $$
The overall style loss is the weighted sum of style loss in each layer.
$$L_{style}=\sum\limits_{l}w_lL_{style}^l $$
In backward propagation,the gradient of $L_{style}^l$
to $\boldsymbol{X}^l$:
$$\nabla_{\boldsymbol{X}^l}L_{style}^l(n,i,h,w)=\frac{1}{NC^2H^2W^2}\sum\limits_{j}\boldsymbol{X}^l(n,j,h,w)(\boldsymbol{G}^l(n,j,i)-\boldsymbol{A}^l(n,j,i)) $$
Based on content loss and style loss, the total loss can be represented as:
$$L_{total}=\alpha{L_{content}+\beta{L_{style}}} $$
Adam Optimizer
To train neural network,batch random gradient descent is used to update network parameters.In experiment,Adam algorithom is used instead of batch random gradient descent, because it converges faster.
Parameter updating:
$$m_t=\beta_1m_{t-1}+(1-\beta_1)\nabla_{\boldsymbol{X}}L $$$$v_t=\beta_2v_{t-1}+(1-\beta_2)\nabla_{\boldsymbol{X}}L^2 $$$$\hat{m_t}=\frac{m_t}{1-\beta_1^t} $$$$\hat{v_t}=\frac{v_t}{1-\beta_2^t} $$$$\boldsymbol{X}\leftarrow\boldsymbol{X}-\eta\frac{\hat{m_t}}{\sqrt{\hat{v_t}}+\epsilon} $$
where $m_t$
is estimition of the order one moment of gradient, $v_t$
is that of the order two. $\hat{m_t}$
and $\hat{v_t}$ is unbiased estimition.
Result:
content figure
style figure
epoch 10
epoch 20
epoch 30
epoch 40
Note: It will cost a lot of time to process images(about one hour each epoch). Model acceleration will be considered in the future.