Generating realistic images from a colored sketch using a diffusion model based on a conditional U-Net
I used the coco_dataset for this project. I generate the sketch, I do the following steps:
- Convert it to Grey Scale
- Invert the grey scale
- Apply Gaussian Blur
- Invert the pixels
- Get the edges by binary_thresholding
- Replace the egdes with the original colors
In this model, I used 128 x 128 size for the image and had 5k images in my dataset from the val2017
- Noise schedule used: Cosine Noise Schedule
- Optimiser: AdamW, weight decay = 1e-3
- Loss Function: L1
- Activation Function: SILU
- Batch Size: 64
- Normalizing the image tensor to [0,1]
(concatenating noisy image + sketch) -> Initial Conv (6 -> 64) + timestep embedding added (128 channels)
DownBlock 1: 128 -> 32 channels DownBlock 2: 32 -> 64 channels DownBlock 3: 64 -> 128 channels
ResidualBlock1: 128 -> 256 ResidualBlock2: 256 -> 256 ResidualBlock3: 256 -> 128
UpBlock 1: 128 -> 64 channels (+ skip) UpBlock 2: 64 -> 32 channels (+ skip) UpBlock 3: 32 -> 16 channels (+ skip)
(predicted noise image)
Starting from pure gaussian noise, iteratively denoise over given number of steps. At each time step and the next time step, I get signal_rate and noise_rate using cosine schedule. The Conditional U-Net takes the noisy image x, the sketch, and the current noise variance as input, and predicts the noise component. From that, a clean image estimate is reconstructed:
predicted_image = (x - noise_rate × predicted_noise) / signal_rate
The noisy image for the next step is then re-composed using the next step's rates:
x_next = signal_rate_next × predicted_image + noise_rate_next × predicted_noise
This process repeats until x converges to a realistic image conditioned on the sketch.
While I did get images resembling a realistic image, at around the 160th epoch, the model stopped getting better.The variance loss and training loss stopped improving much. This was probably because I used only 5K images
Hence I made a 2nd model with improvements. Below are images from the epochs 160, 170, 180, 190, 200, 210, 220, 230, 240
In this model, I used 256 x 256 size for the image and had 118k images in my dataset from the train2017
- Noise schedule used: Offset Cosine Noise Schedule
- Optimiser: AdamW, weight decay = 1e-4 + CosineAnnealingLR as learning rate scheduler (changes the learning rate)
- Loss Function: MSE
- Activation Function: SILU
- Batch Size: 64
- Normalize the image tensors to [-1,1]
(concatenating noisy image + sketch) -> Initial Conv (6 -> 64) + timestep embedding added (128 channels)
DownBlock 1: 128 -> 64 channels DownBlock 2: 64 -> 128 channels DownBlock 3: 128 -> 256 channels
ResidualBlock1: 256 -> 512 ResidualBlock2: 512 -> 512 ResidualBlock3: 512 -> 256
UpBlock 1: 256 -> 128 channels (+ skip) UpBlock 2: 128 -> 64 channels (+ skip) UpBlock 3: 64 -> 32 channels (+ skip)
(predicted noise image)
Both the models generate reasonable ouputs using the actual model and generate random noise using the EMA model. This is because the weights change a lot in the beginning (I did not run too many epochs) and EMA generalizes those weights creating an average that cannot be used
gui.mp4
- Download the checkpoints folder from this drive:
https://drive.google.com/drive/folders/1T-G_GvM5_VO65vhPDhPmF1zeAT_-6b3-?usp=drive_link
-
run the file gui.py using the command:
python3 gui.py -
Click the URL generated
wget http://images.cocodataset.org/zips/train2017.zip
unzip train2017.zip
- run the file color_to_sketch.py using the command and type 128 or 256, as required:
python3 color_to_sketch
-
If you generated images of size 128 x 128: run the u_net2 model:
python3 u_net2.py -
If you generated images of size 256 x 256: run the u_net model:
python3 u_net.py

























