Sketch2Real

Generating realistic images from a colored sketch using a diffusion model based on a conditional U-Net

Extra Criteria: GUI

Getting the Dataset

I used the coco_dataset for this project. I generate the sketch, I do the following steps:

Convert it to Grey Scale
Invert the grey scale
Apply Gaussian Blur
Invert the pixels
Get the edges by binary_thresholding
Replace the egdes with the original colors

Model Architechture:

I tested 2 models

Model - 1 (u_net2.py)

In this model, I used 128 x 128 size for the image and had 5k images in my dataset from the val2017

Noise schedule used: Cosine Noise Schedule
Optimiser: AdamW, weight decay = 1e-3
Loss Function: L1
Activation Function: SILU
Batch Size: 64
Normalizing the image tensor to [0,1]

The Conditional U-Net Architcture is as follows:

INPUT

(concatenating noisy image + sketch) -> Initial Conv (6 -> 64) + timestep embedding added (128 channels)

ENCODER (Downsampling)

DownBlock 1: 128 -> 32 channels DownBlock 2: 32 -> 64 channels DownBlock 3: 64 -> 128 channels

(Bottleneck)

ResidualBlock1: 128 -> 256 ResidualBlock2: 256 -> 256 ResidualBlock3: 256 -> 128

DECODER (Upsampling)

UpBlock 1: 128 -> 64 channels (+ skip) UpBlock 2: 64 -> 32 channels (+ skip) UpBlock 3: 32 -> 16 channels (+ skip)

Final Conv (16 -> 3)

OUTPUT

(predicted noise image)

Image Generation (reverse diffusion)

Starting from pure gaussian noise, iteratively denoise over given number of steps. At each time step and the next time step, I get signal_rate and noise_rate using cosine schedule. The Conditional U-Net takes the noisy image x, the sketch, and the current noise variance as input, and predicts the noise component. From that, a clean image estimate is reconstructed:

predicted_image = (x - noise_rate × predicted_noise) / signal_rate

The noisy image for the next step is then re-composed using the next step's rates:

x_next = signal_rate_next × predicted_image + noise_rate_next × predicted_noise

This process repeats until x converges to a realistic image conditioned on the sketch.

Problems with this model:

While I did get images resembling a realistic image, at around the 160th epoch, the model stopped getting better.The variance loss and training loss stopped improving much. This was probably because I used only 5K images

Hence I made a 2nd model with improvements. Below are images from the epochs 160, 170, 180, 190, 200, 210, 220, 230, 240

Model-2 (u_net.py)

In this model, I used 256 x 256 size for the image and had 118k images in my dataset from the train2017

Noise schedule used: Offset Cosine Noise Schedule
Optimiser: AdamW, weight decay = 1e-4 + CosineAnnealingLR as learning rate scheduler (changes the learning rate)
Loss Function: MSE
Activation Function: SILU
Batch Size: 64
Normalize the image tensors to [-1,1]

The Conditional U-Net Architcture is as follows:

INPUT

(concatenating noisy image + sketch) -> Initial Conv (6 -> 64) + timestep embedding added (128 channels)

ENCODER (Downsampling)

DownBlock 1: 128 -> 64 channels DownBlock 2: 64 -> 128 channels DownBlock 3: 128 -> 256 channels

(Bottleneck)

ResidualBlock1: 256 -> 512 ResidualBlock2: 512 -> 512 ResidualBlock3: 512 -> 256

DECODER (Upsampling)

UpBlock 1: 256 -> 128 channels (+ skip) UpBlock 2: 128 -> 64 channels (+ skip) UpBlock 3: 64 -> 32 channels (+ skip)

Final Conv (32 -> 3)

OUTPUT

(predicted noise image)

Generated images below are from epochs 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70

Both the models generate reasonable ouputs using the actual model and generate random noise using the EMA model. This is because the weights change a lot in the beginning (I did not run too many epochs) and EMA generalizes those weights creating an average that cannot be used

This was the Generated image using the EMA model at the 70th epoch

Using the GUI

gui.mp4

I used the larger model for the gui since it outperforms the smaller model

Download the checkpoints folder from this drive:

https://drive.google.com/drive/folders/1T-G_GvM5_VO65vhPDhPmF1zeAT_-6b3-?usp=drive_link

run the file gui.py using the command: python3 gui.py
Click the URL generated

Downloading and Generating the dataset

wget http://images.cocodataset.org/zips/train2017.zip unzip train2017.zip

run the file color_to_sketch.py using the command and type 128 or 256, as required: python3 color_to_sketch

Training the model

If you generated images of size 128 x 128: run the u_net2 model: python3 u_net2.py
If you generated images of size 256 x 256: run the u_net model: python3 u_net.py

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
readme_images_and_videos		readme_images_and_videos
.gitignore		.gitignore
README.md		README.md
color_to_sketch.py		color_to_sketch.py
gui.py		gui.py
u_net.py		u_net.py
u_net2.py		u_net2.py

Folders and files

Latest commit

History

Repository files navigation

Sketch2Real

Generating realistic images from a colored sketch using a diffusion model based on a conditional U-Net

Extra Criteria: GUI

Getting the Dataset

Model Architechture:

I tested 2 models

Model - 1 (u_net2.py)

The Conditional U-Net Architcture is as follows:

INPUT

ENCODER (Downsampling)

(Bottleneck)

DECODER (Upsampling)

Final Conv (16 -> 3)

OUTPUT

Image Generation (reverse diffusion)

Problems with this model:

While I did get images resembling a realistic image, at around the 160th epoch, the model stopped getting better.The variance loss and training loss stopped improving much. This was probably because I used only 5K images

Hence I made a 2nd model with improvements. Below are images from the epochs 160, 170, 180, 190, 200, 210, 220, 230, 240

Model-2 (u_net.py)

The Conditional U-Net Architcture is as follows:

INPUT

ENCODER (Downsampling)

(Bottleneck)

DECODER (Upsampling)

Final Conv (32 -> 3)

OUTPUT

Generated images below are from epochs 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70

Both the models generate reasonable ouputs using the actual model and generate random noise using the EMA model. This is because the weights change a lot in the beginning (I did not run too many epochs) and EMA generalizes those weights creating an average that cannot be used

This was the Generated image using the EMA model at the 70th epoch

Using the GUI

I used the larger model for the gui since it outperforms the smaller model

Downloading and Generating the dataset

Training the model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages