# ControlNet with Stable Diffusion

Other ways to guide the result, in terms of composition and general pose of the image. There is a task called depth-to-image, where both a text prompt and a depth image are used to condition the model. This allows you to get even more accurate results than the "common" image-to-image technique. 

The original paper proposed 8 conditioning models, but since then some new ones have appeared. Some examples:

 * Edge detection
  * [Canny Edge](https://github.com/lllyasviel/ControlNet#controlnet-with-canny-edge)
  * [HED Boundary](https://github.com/lllyasviel/ControlNet#controlnet-with-hed-boundary) (holistically-nested edge detection)
 * [Poses](https://github.com/lllyasviel/ControlNet#controlnet-with-human-pose)
 * [Scratches)](https://github.com/lllyasviel/ControlNet#controlnet-with-user-scribbles)
 * [Image segmentation](https://github.com/lllyasviel/ControlNet#controlnet-with-semantic-segmentation)
 * [Depth map](https://github.com/lllyasviel/ControlNet#controlnet-with-depth)
 * Official repository for more examples: https://github.com/lllyasviel/ControlNet

### About the technique

- Paper [Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) published in February 2023
- ControlNet was developed from the idea that only text is not enough to solve all problems in image generation.
* First version: https://github.com/lllyasviel/ControlNet#below-is-controlnet-10
* Diagram and additional explanation: https://github.com/lllyasviel/ControlNet#stable-diffusion--controlnet

Paper: https://arxiv.org/pdf/2302.05543.pdf


We are going to implement two ways to condition the model:
 * Edge detection (using Canny Edge)
 * Pose estimation (using Open Pose)

## Installing the libraries

In [8]:
!pip install torch==2.1.0+cu121 torchvision==0.16.0+cu121 torchaudio==2.1.0 torchtext==0.16.0+cpu torchdata==0.7.0 --index-url https://download.pytorch.org/whl/cu121

Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch==2.1.0+cu121
  Using cached https://download.pytorch.org/whl/cu121/torch-2.1.0%2Bcu121-cp310-cp310-linux_x86_64.whl (2200.6 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 2.1.0
    Uninstalling torch-2.1.0:
      Successfully uninstalled torch-2.1.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
xformers 0.0.27.post2 requires torch==2.4.0, but you have torch 2.1.0+cu121 which is incompatible.[0m[31m
[0mSuccessfully installed torch-2.1.0+cu121

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [9]:
!pip install diffusers
!pip install -q accelerate transformers xformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.1.0+cu121 requires torch==2.1.0, but you have torch 2.4.0 which is incompatible.
torchdata 0.7.0 requires torch==2.1.0, but you have torch 2.4.0 which is incompatible.
torchtext 0.16.0+cpu requires torch==2.1.0, but you have torch 2.4.0 which is incompatible.
torchvision 0.16.0+cu121 requires torch==2.1.0, but you have torch 2.4.0 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[3

In [10]:
!pip install -q opencv-contrib-python
!pip install -q controlnet_aux


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
xformers 0.0.27.post2 requires torch==2.4.0, but you have torch 2.1.0 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
!pip install torch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 


Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.12.1
  Downloading https://download.pytorch.org/whl/rocm5.1.1/torch-1.12.1%2Brocm5.1.1-cp310-cp310-linux_x86_64.whl (1354.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 GB[0m [31m546.8 kB/s[0m eta [36m0:00:00[0m0:01[0m00:03[0m
[?25hCollecting torchvision==0.13.1
  Downloading https://download.pytorch.org/whl/rocm5.1.1/torchvision-0.13.1%2Brocm5.1.1-cp310-cp310-linux_x86_64.whl (69.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.8/69.8 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hCollecting torchaudio==0.12.1
  Downloading https://download.pytorch.org/whl/rocm5.1.1/torchaudio-0.12.1%2Brocm5.1.1-cp310-cp310-linux_x86_64.whl (3.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m0m
[?25h[31mERROR: Could not find a version t

In [5]:
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
import cv2 #OpenCV
from PIL import Image
import numpy as np

RuntimeError: Failed to import diffusers.pipelines.controlnet.pipeline_controlnet because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.ip_adapter because of the following error (look up to see its traceback):
cannot import name '_ignored_ops' from 'torch.utils.checkpoint' (/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py)

In [6]:
def grid_img(imgs, rows=1, cols=3, scale=1):
  assert len(imgs) == rows * cols

  w, h = imgs[0].size
  w, h = int(w*scale), int(h*scale)

  grid = Image.new('RGB', size=(cols*w, rows*h))
  grid_w, grid_h = grid.size

  for i, img in enumerate(imgs):
      img = img.resize((w,h), Image.ANTIALIAS)
      grid.paste(img, box=(i%cols*w, i//cols*h))
  return grid

## Generating images using edges



### ControlNet model + Canny Edge

- More information about the model: https://huggingface.co/lllyasviel/sd-controlnet-canny


In [7]:
controlnet_canny_model = 'lllyasviel/sd-controlnet-canny'
control_net_canny = ControlNetModel.from_pretrained(controlnet_canny_model, torch_dtype=torch.float16)

NameError: name 'ControlNetModel' is not defined

In [None]:
pipe = StableDiffusionControlNetPipeline.from_pretrained('runwayml/stable-diffusion-v1-5',
                                                         controlnet=control_net_canny,
                                                         torch_dtype=torch.float16)

In [None]:
from diffusers import UniPCMultistepScheduler
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

In [None]:
pipe.enable_attention_slicing()
pipe.enable_xformers_memory_efficient_attention()

In [None]:
pipe.enable_model_cpu_offload()

### Loading the image

- Image source: https://unsplash.com/pt-br/fotografias/OjhSUsHUIYM

In [None]:
img = Image.open('./img/bird.jpg')
img

In [None]:
type(img)

### Detecting edges using Canny Edge  

* More about the algorithm: http://justin-liang.com/tutorials/canny/
* More about the implemetation in OpenCV: https://docs.opencv.org/3.4/da/d22/tutorial_py_canny.html



In [None]:
def canny_edge(img, low_threshold = 100, high_threshold = 200):
  img = np.array(img)
  #print(type(img))
  #print(img.shape)
  img = cv2.Canny(img, low_threshold, high_threshold)
  #print(img.shape)
  img = img[:, :, None]
  #print(img.shape)
  img = np.concatenate([img, img, img], axis = 2)
  #print(img.shape)
  canny_img = Image.fromarray(img)
  #print(type(canny_img))
  return canny_img

In [None]:
canny_img = canny_edge(img)
canny_img

In [None]:
prompt = "realistic photo of a blue bird with purple details, high quality, natural light"
neg_prompt = ""

seed = 777
generator = torch.Generator(device="cuda").manual_seed(seed)

imgs = pipe(
    prompt,
    canny_img,
    negative_prompt=neg_prompt,
    generator=generator,
    num_inference_steps=20,
)

imgs.images[0]

In [None]:
prompt = ["realistic photo of a blue bird with purple details, high quality, natural light",
          "realistic photo of a bird in new york during autumn, city in the background",
          "oil painting of a black bird in the desert, realistic, vivid, fantasy, surrealist, best quality, extremely detailed",
          "digital painting of a blue bird in space, stars and galaxy in the background, trending on artstation"]

neg_prompt = ["blurred, lowres, bad anatomy, ugly, worst quality, low quality, monochrome, signature"] * len(prompt)

seed = 777
generator = torch.Generator(device="cuda").manual_seed(seed)

imgs = pipe(
    prompt,
    canny_img,
    negative_prompt=neg_prompt,
    generator=generator,
    num_inference_steps=20,
)

grid_img(imgs.images, 1, len(prompt), scale=0.75)

In [None]:
img = Image.open("fox.jpg")

canny_img = canny_edge(img, 200, 255)

grid_img([img, canny_img], 1, 2)

In [None]:
prompt = ["realistic photo of a fox, high quality, natural light, sunset",
          "realistic photo of a fox in the snow, best quality, extremely detailed",
          "oil painting of fox the desert, canyons in the background, realistic, vivid, fantasy, surrealist, best quality, extremely detailed",
          "watercolor painting of a fox in space, blue and purple tones, stars and earth in the background"]

neg_prompt = ["blurred, lowres, bad anatomy, ugly, worst quality, low quality, monochrome, signature"] * len(prompt)

seed = 777
generator = torch.Generator(device="cuda").manual_seed(seed)

imgs = pipe(
    prompt,
    canny_img,
    negative_prompt=neg_prompt,
    generator=generator,
    num_inference_steps=20,
)

grid_img(imgs.images, 1, len(prompt), scale=0.75)

### Example using fine-tuned model

- More about the model: https://huggingface.co/sd-dreambooth-library/mr-potato-head




In [None]:
mph = StableDiffusionControlNetPipeline.from_pretrained("sd-dreambooth-library/mr-potato-head", controlnet=control_net_canny, torch_dtype=torch.float16)
mph.scheduler = UniPCMultistepScheduler.from_config(mph.scheduler.config)
mph.enable_model_cpu_offload()
mph.enable_xformers_memory_efficient_attention()

In [None]:
img = Image.open("indiana-jones.jpg")
img

In [None]:
canny_img = canny_edge(img, 100, 255)
canny_img

In [None]:
prompt = "photo of sks mr potato head wearing a black hat, best quality, extremely detailed"
neg_prompt = "ugly, desfigured, distorted face, poorly drawn face, monochrome"
num_imgs = 3

seed = 777
generator = torch.Generator(device="cuda").manual_seed(seed)
mph.safety_checker = None

imgs = mph(
    prompt,
    canny_img,
    negative_prompt=neg_prompt,
    num_images_per_prompt=num_imgs,
    generator=generator,
    num_inference_steps=20,
)

grid_img(imgs.images, 1, num_imgs, 0.5)

## Generating images using poses

- 3D software to create posed images:
  * Magicposer: https://magicposer.com/
  * Posemyart: https://posemy.art/



### Loading the model to extract poses

In [None]:
from controlnet_aux import OpenposeDetector
pose_model = OpenposeDetector.from_pretrained('lllyasviel/ControlNet')

### Loading the image

In [None]:
img_pose = Image.open('./img/pose01.jpg')

In [None]:
pose = pose_model(img_pose)
grid_img([img_pose, pose], rows=1, cols=2, scale=0.75)

### Loading the ControlNet model

- More about the model: https://huggingface.co/lllyasviel/sd-controlnet-openpose


In [None]:
controlnet_pose_model = ControlNetModel.from_pretrained('thibaud/controlnet-sd21-openpose-diffusers', torch_dtype=torch.float16)
sd_controlpose = StableDiffusionControlNetPipeline.from_pretrained('stabilityai/stable-diffusion-2-1-base',
                                                                   controlnet=controlnet_pose_model,
                                                                   torch_dtype=torch.float16)

In [None]:
sd_controlpose.enable_model_cpu_offload()
sd_controlpose.enable_attention_slicing()
sd_controlpose.enable_xformers_memory_efficient_attention()

In [None]:
from diffusers import DEISMultistepScheduler

sd_controlpose.scheduler = DEISMultistepScheduler.from_config(sd_controlpose.scheduler.config)

In [None]:
seed = 777
generator = torch.Generator(device="cuda").manual_seed(seed)
prompt = "professional photo of a young woman in the street, wearing a coat, sharp focus, insanely detailed, photorealistic, sunset, side light"
neg_prompt = "ugly, tiling, closed eyes, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, body out of frame, bad anatomy, watermark, signature, cut off, low contrast, underexposed, overexposed, bad art, beginner, amateur, distorted face"

imgs = sd_controlpose(
    prompt,
    pose,
    negative_prompt=neg_prompt,
    num_images_per_prompt=4,
    generator=generator,
    num_inference_steps=20,
)
grid_img(imgs.images, 1, 4, 0.75)

> Other tests

In [None]:
img_pose = Image.open("./img/pose02.jpg")

pose = pose_model(img_pose)

grid_img([img_pose, pose], 1, 2, scale=0.5)

In [None]:
generator = torch.Generator(device="cuda").manual_seed(seed)

imgs = sd_controlpose(
    prompt,
    pose,
    negative_prompt=neg_prompt,
    num_images_per_prompt=4,
    generator=generator,
    num_inference_steps=20,
)
grid_img(imgs.images, 1, 4, 0.75)


**To improve the results:**

* Test with differente schedulers. Euler A is also recommended for ControlNet
* Change the parameters (CFG, steps, etc.)
* Use good negative prompts
* Adjust the prompt to be similar to the initial pose
* It is recommended to provide more information regarding the action. For example, "walking in the street" tends to return better results than "in the street"
* You can use Inpainting to adjust faces that have not been correctly generated

In [None]:
prompt = ["oil painting walter white wearing a suit and black hat and sunglasses, face portrait, in the desert, realistic, vivid",
          "oil painting walter white wearing a jedi brown coat, face portrait, wearing a hood, holding a cup of coffee, in another planet, realistic, vivid",
          "professional photo of walter white wearing a space suit, face portrait, in mars, realistic, vivid",
          "professional photo of walter white in the kitchen, face portrait, realistic, vivid"]

neg_prompt = ["helmet, ugly, tiling, closed eyes, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, body out of frame, bad anatomy, watermark, signature, cut off, low contrast, underexposed, overexposed, bad art, beginner, amateur, distorted face"] * len(prompt)
num_imgs = 1

generator = torch.Generator(device="cuda").manual_seed(seed)
imgs = sd_controlpose(
    prompt,
    pose,
    negative_prompt=neg_prompt,
    generator=generator,
    num_inference_steps=20,
)
grid_img(imgs.images, 1, len(prompt), 0.75)

In [None]:
from diffusers import EulerAncestralDiscreteScheduler

sd_controlpose.scheduler = EulerAncestralDiscreteScheduler.from_config(sd_controlpose.scheduler.config)

In [None]:
generator = torch.Generator(device="cuda").manual_seed(seed)
imgs = sd_controlpose(
    prompt,
    pose,
    negative_prompt=neg_prompt,
    generator=generator,
    num_inference_steps=20,
)
grid_img(imgs.images, 1, len(prompt), 0.75)

> Sitting pose

In [None]:
img_pose = Image.open("./img/bench02_img.jpg")

pose = pose_model(img_pose)

grid_img([img_pose, pose], 1, 2, scale=0.5)

In [None]:
prompt = "professional photo of a young woman sitting in a , wearing a coat, sharp focus, insanely detailed, photorealistic, sunset, side light"
neg_prompt = "ugly, tiling, closed eyes, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, body out of frame, bad anatomy, watermark, signature, cut off, low contrast, underexposed, overexposed, bad art, beginner, amateur, distorted face"

generator = torch.Generator(device="cuda").manual_seed(seed)

imgs = sd_controlpose(
    prompt,
    pose,
    negative_prompt=neg_prompt,
    num_images_per_prompt=4,
    generator=generator,
    num_inference_steps=20,
)
grid_img(imgs.images, 1, 4, 0.75)

The other conditioning models you can find here: https://huggingface.co/lllyasviel?search_models=controlnet