Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION. It's trained on 512x512 images from a subset of the LAION-5B database. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and runs on a GPU with at least 10GB VRAM. 

Setup
Please make sure you are using a GPU runtime to run this notebook, so inference is much faster. If the following command fails, use the Runtime menu above and select Change runtime type.

In [None]:
!nvidia-smi

In [None]:
!pip install diffusers==0.4.0
!pip install transformers scipy ftfy
!pip install "ipywidgets>=7,<8"

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Load the pre-trained weights of https://huggingface.co/CompVis/stable-diffusion-v1-4 model

In [None]:
import torch
from diffusers import StableDiffusionPipeline

# make sure you're logged in with `huggingface-cli login`
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4" ) 

In [None]:
prompt = "a photograph of an astronaut riding a horse"
image = pipe(prompt).images[0]  # image here is in [PIL format](https://pillow.readthedocs.io/en/stable/)

# Now to display an image you can do either save it such as:
image.save(f"astronaut_rides_horse.png")

# or if you're in a google colab you can directly display it with 
image

Running the above cell multiple times will give you a different image every time. If you want deterministic output you can pass a random seed to the pipeline. Every time you use the same seed you'll have the same image result.

In [None]:
import torch

generator = torch.Generator("cuda").manual_seed(1024)

image = pipe(prompt, generator=generator).images[0]

image


To generate multiple images for the same prompt, we simply use a list with the same prompt repeated several times. We'll send the list to the pipeline instead of the string we used before.

In [None]:
from PIL import Image

def image_grid(imgs, rows, cols):
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size
    
    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

In [None]:
num_cols = 3
num_rows = 4

prompt = ["a photograph of an astronaut riding a horse"] * num_cols

all_images = []
for i in range(num_rows):
  images = pipe(prompt).images
  all_images.extend(images)

grid = image_grid(all_images, rows=num_rows, cols=num_cols)
grid

And here's how to generate a grid of n × m images.

Generating captions

In [None]:
from huggingface_hub import notebook_login
from datetime import datetime
import os
import random
import numpy as np
import torch
from torch import autocast
from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler

In [None]:
notebook_login()

In [None]:
torch.manual_seed(0)
random.seed(0)
np.random.seed(0)

In [None]:
CLASS_NAMES = ['airplane', 'bicycle', 'boat', 'bus',
           'car', 'dog', 'motorcycle', 'person', 'train', 'truck']

prompts = []

fr = open('prompts.txt','r')
for fl in fr:
    prompts += fl.strip().split(',')

print(prompts)
n_predictions = 6000

In [None]:
for i in range(1, n_predictions):
    for i, prompt in enumerate(prompts):

        with autocast("cuda"):
            image = pipe(prompt, height=128, width=128)["sample"][0]  
                
        now = datetime.now()
        time = now.strftime("%Y%m%d_%H%M%S")

        img_name = CLASS_NAMES[i] + "_" + time + ".png"

        # print("***" + "generated_images/images/" + prompt + "/" + img_name +  "***")

        # image.save("generated_images_prompting/images/" + CLASS_NAMES[i] + "/" + img_name)
        image.save("images/" + CLASS_NAMES[i] + "/" + img_name)
    
    if i % 10 == 0:
        print(str(i) + " completed")