# UFormer: A General U-Shaped Transformer for Image Restoration #
## Project summary ##

### 1. Introduction ###
The general idea of the UFormer architecture is based on U-Net but with transformers in place of the residual blocks. 
In Uformer, there are two core designs. First, the novel locally-enhanced window (LeWin) Transformer block, which performs non overlapping window-based self-attention instead of global self-attention. It significantly reduces the computational complexity on high resolution feature map while capturing local context.

Second, a learnable multi-scale restoration modulator in the form of a multi-scale spatial bias to adjust features in multiple layers of the Uformer decoder. The modulator demonstrates superior capability for restoring details for various image restoration tasks while introducing marginal extra parameters and computational cost. 

Powered by these two designs, Uformer enjoys a high capability for capturing both local and global dependencies for image restoration.

### 2. Method ###
The model that is presented in this project is trained only to do "deblur" task of an input image.

The method implemented follows the instructions founded in the paper. The input image pass into the first convolutional layer to extract the first feature and resize the number of channel into the number of embeddings. 

The core of the LeWin block is the Leff block that is composed by the multi head attention layer and an inverse bottleneck wiht a GeLU activation. Before and after the multihead attention layer there are the normalization layer as described in the recent [article](https://arxiv.org/pdf/2002.04745.pdf). 
The multihead attention layer used to build this architecture is the one implemented inside the Torch library.

The LeWin block is after the patching function, that divide the feature maps in windows of 16x16 patch size and followed by a depatch function that undo the patching process, as explained in the article. 
The encoder block has inside a number of LeWin blocks that depends on the hardware used to train the model, in my case are different from the article, but the number of encoder blocks are the same.

Given the architecture shape, the number of decoder blocks is the same as the encoder's number. 
Each encoder block is followed b a downsample convolutional layer 4x4 with stride 2 and each decoder layer is followed by a transpose convolutional layer 4x4 and stride 2. 

Last layer is a basic convolutional 3x3 layer that generate the float32 image used to calculate the loss from the target. The loss function is the Charbonnier loss, the details of this loss function are explained [here](https://arxiv.org/pdf/1701.03077.pdf), as explained in the article.

The metric used to evaluate the model is the PSNR (Peak signal noise ratio).


### 3. A little guide to do inference ###
To use the model do the following the passages:
- install the requirements
- download the model weights file
- run the ```python3 test.py --model <path_of_weights> --input <dir_that_contains_input_images>```  

that's it.

or use this notebook that is ready to use.

In [None]:
%%pip install -r requirements.txt

In [None]:
#system imports
import glob
# third party imports
import torch
# personal imports
from model import UFormer
from test import test

device = "cuda:0" if torch.cuda.is_available() else "cpu"
#default path where the images are
img_filenames = glob.glob("test/*png")
# default path of the model's weight
ckpt_path = "checkpoints/best_model.pt"
torch.cuda.empty_cache()
model = UFormer()
model.load_state_dict(torch.load(ckpt_path),strict=False)
model.to(device)
model.eval()
# test for each file in img_filenames
for fn in img_filenames:
    test(model,fn)

### 4. Results of evaluation ###
The results of the trained model are influenced from the hardware limitation. 
The graphs describe the PSNR and the validation loss during the training phase
![PSNR](.docs/PSNR.png)
![Valid](.docs/valid_loss.png)