![alt text](https://drive.google.com/uc?id=1fvCtydx3wIPTf2Me3ldzG4OumN9DhPFp)

# Deep Image Prior - Reproduction and Replication 

Arda Kaygan,
Preetha Vijayan,
Yuksel Yonsel

Suprvised by Xin Liu

## 1. Introduction
The goal of this project is to reproduce and evaluate results regarding the inpainting task reported in paper "Deep Image Prior" by [Ulyanov et al.](https://dmitryulyanov.github.io/deep_image_prior) In this notebook, we will replicate the inpainting experiment done by the researchers while exploring new hyperparameter settings and architecture varieties to provide further insight on how these modifications may have impact on the output.

The report is structured as the following. In Section 2, a brief definition of single image restoration method will be provided followed by a summary of the proposed method for single image restoration by the researchers. 

In Section 3, methods that are adopted for reproduction as well as their objectives are explained.
In Section 4, the default architecture of the hour-glass encoder-decoder network used for the inpainting task is introduced.

Section 5 includes the experiments and code variants executed by us and their results. In each of these sections, experimental setup will be introduced and related code will be executed to obtain results. 

In Section 6, results obtained from the experiments are discussed in light of the explanations from the paper and our own background.

In Section 7, the overal success of reproduction will be reported and findings will be addressed. 

---


## 2. Single Image Restoration Using Deep Image Priors

### The Deep Image Restoration Paradigm
In current deep learning paradigm, the task of image restoration through generation of the restored image is commonly thought to involve training of the network parameters over a large dataset. A simple approach within this paradigm is to realize a network as below

$$x = f_{\theta}(x_0)$$

where network parameters, clean image output and corrupted image input are denoted with $\theta$, $x$, $x_0$ respectively. After training with multiple clean-corrupted image pairs, this network may be expected to output clean image $x$ when the corrupted image $x_0$ is presented as input in the test phase. 

The learned parameters achieved as a result of training on multiple images is expected to learn reversing the corruption by being able to extract image priors from the corrupted data. Formally, this network at hand should estimate the following likelihood;

$$p(x|x_0) \tilde{=}  p(x_0|x)p(x)$$

Where $p(x)$ here refers to the image prior. In this particular image restoration task, it refers to the probability of pixel values formed by the actual content of the image before the corruption with its features such as edges, surfaces or higher level content. Consequently, a realistic estimate of the prior is needed for a succesful image restoration.

The researchers are crticial of this approach, particularly due to the unclear nature of how priors are learned through this type of training and because of the possiblity of overfit to the training samples instead. They claim that the information needed in order to capture an estimate of the image prior at least which includes a sufficient level of low-level image statistics can be extracted from the single "test image". Thus, the network can be handcrafted to contain prior information in its parameters without any training. The parameters can then be used for outputting a restored image estimate. 

### Realistic Image Generation using Encoder/Decoder Networks

In order to implement their proposed method, the researchers utilize a generator network architecture which is used for outputting realistic images from random noise such that;

$$a = f_{\theta}(z)$$

Where $z$ is an image formed with random noise and $a$ is the output image. The parameters of the network $f_{\theta}$ determines nature of the content that will be generated. 

### Inverting Images Without Learning Through Energy Minimization

While encoder-decoder network uses generative model, image restoration can also be done using discriminative model. Therefore, another method the researchers use is through minimization of loss without using learning. Let $\alpha$, $x_0$, $\alpha_*$ denote the variable image to be optimized, the corrupted image at hand and the solution for optimization problem respectively;

$$ \alpha_* = argmin_{\alpha} L(\alpha,x_0) $$

However, this minimization will lead to a trivial solution where $\alpha_*=x_0$. Adding a regularization factor is found to achieve a different result.

$$ \alpha_* = argmin_{\alpha} L(\alpha,x_0) + R(\alpha) $$

If carefully selected depending on the task, the regularization term combined with the right choice of function can be used for the image restoration task, to achieve an estimate of $x$, the clean image. An example given for R(x) is total variation (TV) which leads the optimization path away from noisy outcomes. The regularization term, crafted for the particular task can also be seen to capture the prior information despite it being generic. The total variation regularizer capturing the smooth surface features for the denoising task is an example of this generic prior extraction. 

### The Deep Image Prior Method

Combining the frameworks introduced above, the researchers introduced the following minimizer for the task of generating the clean image.

$$ \theta_* = argmin_{\theta} L(f_{\theta}(z),x_0)$$

where $z$ is the random image input. Compared with the approach denoted in the previous subsection, the optimization is done over the network parameters denoted with $\theta$ instead of the input variables that are expected to be transformed into the clean image. The corrupted image is still used as the target of the loss minimization but the regularizer which was able to capture generic prior information is now replaced by the deep network which provides powerful parameters for capturing the prior information specific to the image that is being processed. 

Contrary to the approach discussed in the first section, this approach involves optimization, not training with the aim of optimized parameters $\theta$ acquiring the prior information needed to produce the clean image as such;

$$ x = f_{\theta_{i}(z)}$$

where $\theta_{i(z)}$ denotes parameters during iteration i, given an iterative optimization is employed. The random image $z$ remains the same throughout the optimization. 

Note that this "useful" parameters are not those that are solutions of the optimization problem denoted as $\theta_*$ which leads to the corrupted image again. This is defined as the overfitting case of this method by the researchers, and a key aspect of this method is early stopping. This results in the θ settling for a local optimum as long as the number of iterations in the optimization process remains low enough not to overfit data.
 
### Applications of Deep Image Prior

1.   Denoising
![alt text](https://drive.google.com/uc?id=1rfIqKBadkOXz6Aejoxta28Gd4ICfzqEG)
2.   Super Resolution
![alt text](https://drive.google.com/uc?id=1yFP2cQRNhUMiAU4d0jhaR51pY9O8QMHC)
3.   Inpainting
![alt text](https://drive.google.com/uc?id=1QS9AnyuEYWBmgMm9FdHTAKlgs5dxioqF)
4.   Flash - No flash
![alt text](https://drive.google.com/uc?id=1cOhhiH-EeDaISdeWXljci_FDhO6DeNN9)

## 3. Objective and Methods

In this project, the inpainting task will be focused on, in which the corrupted image has lost pixels, and the task is to generate the complete image. We will employ the following loss function which is used by the researchers;

$$ L(\alpha,x_0) = | (\alpha-x_0) \circ m |^2 $$

where $\circ$ refers to Hadamard product which provides pixel-wise multiplicion with the mask $m$ and the resulting difference image between the current output of the network and the corrupted image, which is formed by applying the mask $m$ to the clean image $x$.

There are two types of mask that will be used in experiments, which are 50% bernoulli and text masks. The bernoulli masked image is formed by masking 50% of pixels selected according to random distribution. 

![Bernouli](https://drive.google.com/uc?id=14TtSCn5YvLj1f4yD3EQxyExWTKnzU6_4)

The text mask is formed by masking the image with letters written on the image. 

![Text over Image](https://drive.google.com/uc?id=1begQmJE4i69fH3pOBk6Za9EfOVfuvB22)

The aim of this project can be summarized to provide evaluation for claims made by the authors on how this method works while also attempting to replicate a part of the results reported. These claims are;

*   **Erroreuous early-stopping leads to undesired results due to the model converging to the optimal, corrupted image.** 
*   **The method gives better results compared to legacy methods employed in [Papyan et. al](https://arxiv.org/pdf/1705.03239.pdf) for inpainting task with 50% bernoulli noise.**
*   **The network structure should resonate with the task at hand.**
*   **The network is handcrafted for the particular image as a parametrization of the prior.**
*   **Better inpainting results are obtained with deeper networks.**
*   **Adding skip connections that cause network to work "too well" is detrimental to the performance.**

As it will be shown in experiment 3 and 1, we have observed that early-stopping is not crucial due to the regularizing nature of the architecture, in contrast to **Claim 1**. **Claim 2** is shown to be correct by succesful replication of the results close to the those given on the table. In line with other claims, the architecture of the network; mainly it's depthness and number of skip connections is also observed to be determining the pattern that can be produced by the natwork, thus impactinng the results.

---

## 4. Default Architecture

The architecture that is employed in our experiments is based on the model specification provided by the researchers, and the existing pytorch module for generating the network is utilized. The figure below illustrates the components of the default architecture;

![arch](https://drive.google.com/uc?id=1drExrkoL-MYCpMdh_xq_i4HfY5MXLz6X)

The architecture consists of an ecoder - decoder system with the flexibility of having skip connections. The downsampling blocks form the encoder, the upsampling blocks form the decoder.

The encoder extracts vital features preserving the detailed underlying structure of the image, the context and the location of that information, whereas the decoder produces a clean version of the input image by recovering image details as it progresses through its layers in a bottom-up way after the last layer of the encoder. Each skip connection allows to directly transfer the information from an encoder layer to its corresponding decoder one, and this is appealing since the corrupted image and output clean version share large parts of the low-level information like the location of prominent edges. In fact, skip connections allow to remember different levels of details that are useful to reconstruct the final output image.

Architecture summary:

* Number of downsampling, upsampling and skip layers should be equal.
* $d_{i}$ - downsampling connection
*  $n_{d}[i]$ - number of downsampling filters
* $k_{d}[i]$ - kernel size of the downsampling layer
* $u_{i}$ - upsampling connection
*  $n_{u}[i]$ - number of upsampling filters
*  $k_{u}[i]$ - kernel size of the upsampling layer
* $s_{i}$ - skip connection
* $n_{s}[i]$ - number of skip connections
* $k_{s}[i]$ - kernel size of the skip layer​
* Activation function: 'LeakyReLU|Swish|RELU|none' (We use LeakyReLu)
* Padding: zero|reflection (We use reflection)
* Upsample_mode: 'nearest|bilinear'|deconv  (We use 'nearest')
* Downsample_mode: 'stride|avg|max|lanczos2' (We use maxpooling)



 

---




## 5. Experiments and Results

### Experiment 1: Replication of Table for Bernoulli Mask
### Goal:

The primary goal of this experiment is to replicate the results measured with PSNR values obtained in Table 1 where the results of [Ulyanov et al.](https://dmitryulyanov.github.io/deep_image_prior) is compared to the proposed method ([Papyan et. al](https://arxiv.org/pdf/1705.03239.pdf)). The secondary goal is to examine how PSNR values whose maximum values are presented on the table evolve with iterations in order to gain insight on convergence to the "desired result" as opposed to the global optimum i.e. the corrupted image itself.

The Bernoulli mask here refers to the random deletion of pixels given a fraction. In this experiment %50 of the pixels are absent in corrupted images. 

As opposed to the text inpainting or large-hole inpainting tasks, the architecture used for bernoulli mask inpainting was not available in supplementary material. Therefore we used the same architecture and input with text inpainting task. Nevertheless, performance levels were sufficient to suggest that these two tasks are similar in terms of this architectur's ability to tackle with it.

### What Did We Change?

For this experiment, the available inpainting task notebook provided by authors is used with minimal changes to the code. These changes include;

*   Modification of optimization part for sequential optimization of multiple images to be allowed
*   Modification of optimizer function for PSNR values during the iteration to be recorded into the a dict with keys as image names.
* Modification of results part for the results from multiple images to be shown and table to be generated

The architecture involving the number of layers and skip connections are kept same with the default text inpainting experimennt, initialized by the following segment of code;



```
net = skip(input_depth, torch_to_np(imagedict[key]).shape[0], 
            num_channels_down = [128] * 5,
            num_channels_up =   [128] * 5,
            num_channels_skip =    [128] * 5,  
            filter_size_up = 3, filter_size_down = 3, 
            upsample_mode='nearest', filter_skip_size=1,
            need_sigmoid=True, need_bias=True, pad=pad, act_fun='LeakyReLU').type(dtype)
  net = net.type(dtype)
```



### Experimental Setup

The mask used for this experiment is by using the .... method made available by the authors which produces mask pixels with given percentage of occurence;


```
def get_bernoulli_mask(for_image, zero_fraction=0.95):
    img_mask_np=(np.random.random_sample(size=pil_to_np(for_image).shape) > zero_fraction).astype(int)
    img_mask = np_to_pil(img_mask_np)
    
    return img_mask
```



The imported libraries are given below. Note the importing of the modded utils library, which allows for the image name to be passed to the optimization function, allowing for individual psnr values to be recorded for each image.

In [0]:
from __future__ import print_function
import matplotlib.pyplot as plt
%matplotlib inline
import prettytable
import os
# os.environ['CUDA_VISIBLE_DEVICES'] = '1'

import numpy as np
from models.resnet import ResNet
from models.unet import UNet
from models.skip import skip
import torch
import torch.optim

from utils.inpainting_utils import *
from utils.common_utils_modded import *
from skimage.measure import compare_psnr

torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark =True
dtype = torch.cuda.FloatTensor

The dataset is formed by images with dimensions 512x512.

In [0]:
PLOT = True
imsize = -1
dim_div_by = 64
shape = [512,512]

The dataset, including images from the table are parsed below and the dictionarys for storing data during the iteration are initialized. See below for illustration of masked-non masked image pairs.

In [0]:
print(os.listdir('data/tabledata4/'))
imagenames = os.listdir('data/tabledata4/')
imagedict = dict();
psnrdict = dict();
maskdict = dict();
for file in os.listdir('data/tabledata4/'):
  impath = 'data/tabledata4/' + file
  img_pil, img_np = get_image(impath, shape)
  img_var = np_to_torch(img_np).type(dtype)
  imagedict[file] = img_var
  mask_pil = get_bernoulli_mask(img_pil,0.5)
  mask_np = pil_to_np(mask_pil)
  mask_var = np_to_torch(mask_np).type(dtype)
  maskdict[file] = mask_var
  psnrdict[file] = [0]
  plot_image_grid([img_np, mask_np, mask_np*img_np], 3,11);
  print(file)

![alt text](https://pbs.twimg.com/media/EV04sS0X0AknFSF?format=png&name=900x900)
![alt text](https://pbs.twimg.com/media/EV04q-AXkAIe7PU?format=png&name=900x900)
![alt text](https://pbs.twimg.com/media/EV04phQWsAAU2T-?format=png&name=900x900)
![alt text](https://pbs.twimg.com/media/EV04n15XgAUDQ3z?format=png&name=900x900)
![alt text](https://pbs.twimg.com/media/EV04kC-WkAM8nuO?format=png&name=900x900)
![alt text](https://pbs.twimg.com/media/EV04iX0WoAUPu7J?format=png&name=900x900)
![alt text](https://pbs.twimg.com/media/EV04bGcWsAA9rCC?format=png&name=900x900)
![alt text](https://pbs.twimg.com/media/EV04XVJWsAEdWxr?format=png&name=900x900)
![alt text](https://pbs.twimg.com/media/EV04UWfX0AAxysd?format=png&name=900x900)
![alt text](https://pbs.twimg.com/media/EV04R1hX0AA9jjX?format=png&name=900x900)
![alt text](https://pbs.twimg.com/media/EV04lkOXsAMgJLB?format=png&name=900x900)
![alt text](https://pbs.twimg.com/media/EV04cjMXQAAwO-L?format=png&name=900x900)
![alt text](https://pbs.twimg.com/media/EV04ZDgXQAIEncQ?format=png&name=900x900)

The neural net that is used follows the same architecture as the default one. For this experiment, number of iteration is generally kept higher than those that aim for obtaining the final PSNR value to see the path of convergence. show_every parameter is also used for recording the PSNR values after doing the comparison as opposed to its default function, which was showing output images.

In [0]:
pad = 'reflection' # 'zero'
OPT_OVER = 'net'
OPTIMIZER = 'adam'
i = 0
# Same params and net as in super-resolution and denoising
INPUT = 'noise'
input_depth = 32
LR = 0.1 
num_iter = 12001
param_noise = False
show_every = 100
figsize = 5
reg_noise_std = 0.03

# Loss
mse = torch.nn.MSELoss().type(dtype)

The architecture used is given below by the method for initializing it.

In [0]:
net = skip(input_depth, img_np.shape[0], 
            num_channels_down = [128] * 5,
            num_channels_up =   [128] * 5,
            num_channels_skip =    [128] * 5,  
            filter_size_up = 3, filter_size_down = 3, 
            upsample_mode='nearest', filter_skip_size=1,
            need_sigmoid=True, need_bias=True, pad=pad, act_fun='LeakyReLU').type(dtype)

net = net.type(dtype)

# Compute number of parameters
s  = sum(np.prod(list(p.size())) for p in net.parameters())
print ('Number of params: %d' % s)

A closure method is defined as in the default notebook provided by the authors, with the modification of psnr comparisons and recording of values. This method determines the code to be executed in each step. 

In [0]:
def closure(imagename):
    
    global i
  
    net_input = net_input_saved
    if reg_noise_std > 0:
        net_input = net_input_saved + (noise.normal_() * reg_noise_std)
        
        
    out = net(net_input)
    mask_var = maskdict[imagename]
        
    total_loss = mse(out * mask_var, img_var * mask_var)
    total_loss.backward()
        
    print ('Iteration %05d    Loss %f' % (i, total_loss.item()), '\r', end='')
    if  PLOT and i % show_every == 0:
        out_np = torch_to_np(out)

        psnrval = compare_psnr(torch_to_np(img_var),out_np)
        psnrdict.setdefault(imagename).append(psnrval)

    i += 1

    return total_loss

The main optimization loop is given below, iterating through images and initializing the neural net for each.

In [0]:
for key in imagedict:

  print(key)

  net_input = get_noise(input_depth, INPUT, torch_to_np(imagedict[key]).shape[1:]).type(dtype)
  
  net_input_saved = net_input.detach().clone()
  noise = net_input.detach().clone()

  net = skip(input_depth, torch_to_np(imagedict[key]).shape[0], 
            num_channels_down = [128] * 5,
            num_channels_up =   [128] * 5,
            num_channels_skip =    [128] * 5,  
            filter_size_up = 3, filter_size_down = 3, 
            upsample_mode='nearest', filter_skip_size=1,
            need_sigmoid=True, need_bias=True, pad=pad, act_fun='LeakyReLU').type(dtype)
  net = net.type(dtype)

  img_var = imagedict[key]
  p = get_params(OPT_OVER, net, net_input)
  optimizem(OPTIMIZER, p, closure, LR, num_iter,key)
  i = 0


### Results

In this subsection, the parameters used for all images in the dataset are provided for different trials followed by the obtained PSNR graphs.

For this trial, colored images are also used along with their grayscale versions. But comparison with the result of these should be avoided as the bernoulli mask method used causes the mask to be colored as well, making the problem more similar to denoising than inpainting. This is probably due to the authors only meant this mask to be used in grayscale images, but we chose include them anyway.

#### 1st Trial

* Number of Iterations: 6001
* Show_Every: 10 
* Learning Rate: 0.01

The overal results for the first trial is given below. Note that number of iterations and learning rate parameters are the same with those used with the text-inpainting task by the authors.

![PSNRS](https://pbs.twimg.com/media/EVAJj-ZXQAcEyZp?format=png&name=small)

The steady increase in PSNR's are in line with the convergence behaviour described by the authors. As they noted in (supmat) a destabilization occurs for some images troughout optimization, captured as the large PSNR drops in this graph. These are described by a large amount of sudden increase in MSE in their trials, resulting in excessively blurry images before the MSE gets stabilized again. We are convinced that the drops in PSNR is indeed due to this. Contrary to a sharp MSE increase that may lead the image to converge to the noisy, corrupted image quickly, these destabilizations drop the PSNR computed with the clean image as well, keeping the parameters away from the desired result. This has been confirmed in the following trial producing a full dark image resulting from an extreme destabilization and next Experiment 3, for which we were able to get the blurred image formed at the bottom of this PSNR drops. 

Furthermore, the curves for all images except two is seen to be mostly flatted at the end of the last iteration. Thus it is possible to refer that this number of iterations is mostly sufficient for getting the maximum values. 

The maximum values obtained are as below;

![psnr](https://drive.google.com/uc?id=1m0FEKyzG2Z-VAWElmvEanBwM4b2mPJT4)

The maximum values obtained are consistent with those presented in the table 1 of the paper although there are differences. The house image, is again the one with the best PSNR value. The ranking is mostly preserved. 

Another result is that the colored images, for which the masking technique turned this task into a denoising problem with colored pixels, turned out to have the same convergence patter. For the case of Lena, both colored and grayscale versions converged very similarly. 

#### 2nd Trial

* Number of Iterations: 10001
* Show_Every: 50
* Learning Rate: 0.01

This trial is done on only two images, man and cameraman with an increased number of iterations and higher learning rate to get further information on the convergence rate and see if higher results are attainable.

![PSNRS](https://pbs.twimg.com/media/EVBYPDjWAAIjByU?format=png&name=small)

The destabilization is still observed with one image, and no decrease in PSNR values observed otherwise even when number of iterations are incresed by %66.6

### 3rd Trial

* Number of Iterations: 12001
* Show_Every: 50
* Learning Rate: 0.1

This trial is done with an increased learning rate and slightly increased number of iterations;

![PSNRS](https://pbs.twimg.com/media/EVBY_NLWoAAIG74?format=png&name=small)

The destablization in this case seem to be critically effective, leading the optimization to stop. The neural net starts to produce all black images after some point. Otherwise, the curves follow the same pattern with those in Trial 1 and 2.

---





### Experiment 2: Data generation code variant and Architecture Modification
### Goal: 

The first part of this experiment was to create a code variant which masks a clean image with random text over it using simple numpy multiplication. The other part is to change the number of filters per layer in the network and to change the depth of the network. This experiment was done on 'Kate.png' using skip network.

### What Did We Change?

* Dataset generation: get_binary_text_mask() lets us decide the location of the random text and the text itself whereas text_inpainting() multiplies the input image with the text mask to generate corrupted image.





In [0]:
def get_binary_text_mask(for_image, sz=20, position=(128, 128) text='hello world'):
    font_fname = './font/FreeSansBold.ttf'
    font_size = sz
    font = ImageFont.truetype(font_fname, font_size)
    img_mask = Image.fromarray(np.array(for_image)*0+255)
    draw = ImageDraw.Draw(img_mask)
    draw.text(position, text, font=font, fill='rgb(0, 0, 0)')

    binary_mask_temp = np.array(img_mask)
    binary_mask = np.zeros_like(binary_mask_temp, dtype=np.float32)
    binary_mask += 1.0
    binary_mask[binary_mask_temp<254] -= 1.0
    return binary_mask

def text_inpainting(input_image):
    image_shape = [shape for shape in input_image.shape[:-1]]
    output_image_shape = [1] + image_shape + [32]
    ## get mask
    binary_mask_1 = get_binary_text_mask(input_image, sz=30, position=(128, 128), 
                                         text='Image Corruption')
    binary_mask_2 = get_binary_text_mask(input_image, sz=25, position=(250, 300), 
                                         text='Text Over Image')
    binary_mask_3 = get_binary_text_mask(input_image, sz=25, position=(200, 400), 
                                         text='Deep Learning Project')
    binary_mask = np.multiply(np.multiply(binary_mask_1, binary_mask_2), binary_mask_3)
    corrupted_image = np.multiply(input_image, binary_mask)

* Changed the number of filters in downsampling, upsampling and skip connection (from 4 to 128) while keeping the depth of the network same (Depth = 5)

As shown in the code below, in the place of 'No of filters', we put 4, 16, 32, 64 and 128 to compare the PSNRs.


In [0]:
 net = skip(input_depth, torch_to_np(imagedict[key]).shape[0], 
            num_channels_down = [No of filters] * 5,
            num_channels_up =   [No. of filters] * 5,
            num_channels_skip =    [No. of filters] * 5,  
            filter_size_up = 3, filter_size_down = 3, 
            upsample_mode='nearest', filter_skip_size=1,
            need_sigmoid=True, need_bias=True, pad=pad, act_fun='LeakyReLU').type(dtype)
  net = net.type(dtype)

* Changed the depth of the network with and without skip connections while keeping the number of filters in each layer the same (No. of filters = 128).

As shown in the code below, 'depth' is changed between 1 to 8 to see the effect in terms of image accuracy. To remove the skip connection, put the No of filters in num_channels_skip to zero.


In [0]:
 net = skip(input_depth, torch_to_np(imagedict[key]).shape[0], 
            num_channels_down = [128] * depth,
            num_channels_up =   [128] * depth,
            num_channels_skip =    [128] * depth,  
            filter_size_up = 3, filter_size_down = 3, 
            upsample_mode='nearest', filter_skip_size=1,
            need_sigmoid=True, need_bias=True, pad=pad, act_fun='LeakyReLU').type(dtype)
  net = net.type(dtype)

### Results:

1. Data was successfully corrupted with the new code variant.

![data](https://drive.google.com/uc?id=1lpMBQd1zMehqxW7YN6DZHUtdzUIf2VOd)

2. As shown in the table below, increasing the number of filters in a layer increased the PSNR value. This is simply because, more number of filters mean more parameters which can learn more features of the image properly. We could confirm that the number of filters used in the original code (which is 128) indeed produced the best result.

![depth](https://drive.google.com/uc?id=105FWKHYAW4IsLewxcD26sxHD6Sj4R4hs)

3. There are 2 scenarios to be considered here. One of them is increasing the depth without skip connection and the other one is increasing it with skip connection.

![skip](https://drive.google.com/uc?id=1oRNeYRI0lmt93D5jN8awZkTW3jeQ-vJH)

As we can see, for shallow networks, skip connection is detrimental. This is because it transfers even the noisy details of the image to the decoder. However, as deeper the network, without skip connection, it suffers from vanishing gradient problem, where the network sort of forgets some nuiances of the image to be recovered. This is solved by adding skip connections which allow the network to remember various levels of details
that are crucial to reconstruct the final image.

### Experiment 3: Hyperparameter Testing - Optimizers
### Goal
In this part, our aim is to experiment the affect of changes on hyper-parameters. In order to understand the changes and the performance, ```Kate.png``` has been selected. In the following sections, firstly the changes on LR will be benchmarked with same optimizer. Secondly the type of the optimizer will be changed and the results of this change will be logged. All of the graphs will be with PSNR on the y-axis as already provided in the previous sections. 
### What Did We Change?
As it is explained in the goal part, firstly learning rate has been changed. Then, optimizer has been changed so as to create early stopping phenomenon with an optimizer. In the experimental setup section, it will be appearant how the output of the image gets better with respect to number iterations. Therefore, we will have a chance to understand the optimal number of iterations so that we can achieve the best performance. Therefore, we can say that in Experiment 3 part we may have best formulation to understand best learning rate value and number iterations.
#### *Change on LR*
As already explained before, the optimization is to find an early stopping point where our image is painted. We have sweeped the early stopping point from 0.001 to 0.1. The optimizer is *Adam*. 
#### *Change on optimizer*
Originally, *Adam* is selected as an optimizer. However, they also provide L-BFGS as another optimizer. This optimization is a variation of quasi-Newton method which works with limited amount of computer memory. 

In order to change these in the source code, please refer to Experiment 1.
Moreover, following piece of code illustrates how the PSNR value is logged. 
```
import math
def psnr(original_image, output_of_the_network):
    mse = np.mean( (img1 - img2) ** 2 )
    if mse == 0:
        return 100
    return 20 * math.log10(1.0 / math.sqrt(mse))
```
### Results
*Optimizer is Adam and LR = 0.1*

![a1](https://raw.githubusercontent.com/yyunon/reproducibility-project-group-71/master/sims/6173bd9f-bc8a-48ac-a4f2-bc3314e6c49b.jpeg?token=AEKNEMW35Q6EVOOQU2YKH3C6TOAJY)

*Optimizer is Adam and LR = 0.01*

![a2](https://raw.githubusercontent.com/yyunon/reproducibility-project-group-71/master/sims/psnr_training_001.jpg?token=AEKNEMWNVETRP4XFRWZQDYK6TOAMI)

*Optimizer is Adam and LR = 0.001*

![a3](https://raw.githubusercontent.com/yyunon/reproducibility-project-group-71/master/sims/psnr_training_0001.png?token=AEKNEMTX7PCSZNJ3DZZWRVS6UM67E)

*Optimizer is L-BFGS and LR = 0.01*

![a4](https://raw.githubusercontent.com/yyunon/reproducibility-project-group-71/master/sims/psnr_training_lfbgs_001.jpg?token=AEKNEMRS54W2QLVHZUKLFZK6TOAPO)


We can conclude that the best results are achieved with `adam` and `LR=0.001` with 6001 epochs. It is clear that once we change the optimizer to L-BFGS, the convergence is faster but with a lower PSNR value at the end. This is mainly due to the fact that this optimizer is used for other purposes such as text.

Once the simulations for larger learning rates are investigated, it appears that there exists sudden falls with larger learning rates. The question is what happens to output in these sudden falls and why does this occur?

*What does output of the network produce at those sudden drops?*

![a2](https://raw.githubusercontent.com/yyunon/reproducibility-project-group-71/master/sims/13aaa7d1-3560-413a-aee9-f4844b4c4de9.jpeg?token=AEKNEMRVU3EI2TS2DOTHU3S6TOBDY)

We can clearly see that this output image in sudden drops is related to the original image. This can be understood from the fact that network learns features with lower frequency at that instant. This phenomenon cannot be related to the convergence to the noisy image as that image is way outer than the optimization space.

---

 

##8.Discussion

The results obtained from Experiment 1 made it possible for close replication of results with those reported on the paper for table 1. In both original and replication results, the images with large smooth regions led to better PSNR results. 

Furthermore, those that converge to a higher value are observed to reach higher values earlier, benefitting from a "faster start" than others. An example is the house test image which involves a large area covering the sky above the roof. Lena, which came second in the paper and third in the replication results includes large areas that correspond to woman's skin and the background. 

There may be two reasons behind this pattern;

* First, due to each error value for each pixel having equal contribution to the PSNR value, the large smooth regions become a critical determinant of the resulting PSNR. Once those regions are fixed, the PSNR value for the image can reach high levels without having reached succesful results for the rest of the regions which may be semantically more significant despite their smaller size. 

* Second, the parametrization in use is able to produce smooth regions much more easily than noisy regions, which is the main motivcation behind this method as explained by authors. An important result for this ranking is that the impedance towards noise here not only involves uncorrelated noise which is resulted by random deletion of pixels, but also correlated noise that is part of the prior when smooth regions are taken as the original signal, such as the high frequency regions in images that performed worse. 

This property of the parametrization i.e. the neural net strtucture is illustrated in the journal paper by the authors. 

![Structure](https://pbs.twimg.com/media/EVWb3tPXgAMNnAQ?format=jpg&name=large)

These images sampled from indepeendent random noise illustrate the capability of these networks regarding the type of pattern they can produce, giving an idea on how much they "resonate" with a particular pattern due to their structure, as noted by authorrs. The architecture used in these trials is the one referred in the second column from the right. Altough it is able to produce higher frequency patterns as opposed to the version without skip connections on it's left which only produce a blurry pattern, it is observed that higher frequency patterns above some level are still not likely to be observed, still producing a rather blurry image. This is in line with the fact that the neural network is more likely to succeed when given images with large smooth regions. 

Results from Experiment 2 are also in line with the fact that the depthness and the skip connections of the network determine what it can produce, and impact the results accordingly. 

The fact that the architecture's likeliness to produce a particular kind of pattern, which can be sampled from the independent noise distribution also enables for explaining another result obtained in trials of Experiment 1 and 3, which is related to nearly-monotonic increase of PSNR values except the sudden drops. 

Before starting experimentation, we expected to have PSNR curves that go down after some iterations of reaching a maximum value due to the optimization being done with comparison to the corrupted image. In this scenario, altough going through the clean image solution halfway through, the parameters would eventually converge to the actual optimal solution, the corrupted image, making the early-stopping crucial. This convergence path is illustrated below in the journal paper of the authors:

![Convergence](https://pbs.twimg.com/media/EVWiaVOXgAIsHXs?format=jpg&name=large)

However, in Experiment 1 and 3, except destabilizations leading to sudden drops, the PSNR curve is seen to improve and stay steadily after some point without decreasing again even when the number of iterations is doubled, or the learning rate is increased. This led us to conclude that altough there may be infinite possibities of images for the network to produce, these possiblities are confined in a parameter space that is determinend by the architecture acting like a regularizer. Thus, when parameters are initialized with small values, the network simply is not able to produce the corrupted image even if it's optimized using it. 

This is due to the regularization implicitly determined by the network, regardless of the early stopping. So instead of the graph above, another one with a shell that encapsulated achievable solutions, and the $x_0$ being outside this shell, far from the $x_{gt}$ would be more accurate. 

An implication result of this is that for some tasks and architectures, **early-stopping may not be crucial due to the reason that we want to prevent the network converging to $x_0$ but due the reason that we want to avoid destabilizations that may occur on the boundary of regularizer**. This is in line with the results obtained in Trial 3 in particular, for which the destabilization caused an irreversible degradation. The more iterations, the more it becomes likely for such sudden drop to happen. Increased learning rate is another factor leading to more detrimental destabilizations.  

##9.Conclusion

In conclusion, we have succesfully reproduced the results given for Table 1 of [Ulyanov et al.](https://dmitryulyanov.github.io/deep_image_prior) as well as results related to text-inpainting task producing similar valuest to those reported.

The effect of regularization by the network itself is clearly observed and sudden increases in optimized MSE reported by authors is confirmed.

The changes in architecture is observed to lead changes in performance. Skip conections are confirmed to be detrimenntal for shallow networks while they provide means for details to be transferred to the output for deeper architectures. 

---

