# Generating Images from Text

[Stable Diffusion](https://en.wikipedia.org/wiki/Stable_Diffusion) and similar models can create stunning images from text descriptions, revolutionizing creative workflows. Unlike classification models that analyze existing images, diffusion models create new images from noise.

Stable Diffusion is like a super-smart magic coloring book: you whisper what picture you want—say, “a purple dinosaur eating ice cream on the moon”—and the coloring book starts with a page covered in silly scribbles (just random dots). Then, almost like an eraser and crayon working together, it keeps gently wiping away the messy dots and adding the right colors and shapes bit by bit, listening to your words the whole time, until the random mess turns into the exact picture you asked for.

In [None]:
# @title Install the Dependencies {"display-mode":"form"}

# Install and import dependencies
!pip install --quiet "diffusers[torch]" transformers accelerate safetensors

import torch
from diffusers import AutoPipelineForText2Image, DPMSolverMultistepScheduler
from google.colab import output

# Set up some basic variables.
device = "cuda" if torch.cuda.is_available() else "cpu"
generator = torch.Generator(device).manual_seed(1337)

In [None]:
# @title Generating an Image

model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"

pipeline = AutoPipelineForText2Image.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  use_safetensors=True,
  low_cpu_mem_usage=True
).to(device)

# Swap to a faster, lower‑VRAM scheduler
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)

# Memory‑saving toggles (≈‑1.5 GB on a T4)
pipeline.enable_attention_slicing()                  # chunk cross‑attention
pipeline.enable_vae_slicing()                        # same for VAE decoder
pipeline.enable_model_cpu_offload()                  # swap idle layers to CPU

prompt = "A pygmy hippo in New York City watching Instagram on an iPhone" # @param {"type":"string","placeholder":"Prompt"}

image = pipeline(
    prompt,
    num_inference_steps=50,
    guidance_scale=7.5,
    generator=generator,
).images[0]

output.clear()

image

## Quality versus Speed

You can tweak some knobs to control whether you want your image to be good or if you're in a hurry.

-	`num_inference_steps`: 10–50, lower is faster.
-	`guidance_scale`: 1–15 Higher if going to be more faithful to your text, but also a lot stiffer.
-	`width / height`: Stick to multiples of 64; go larger only if your VRAM's been hitting the gym.

## CUDA or CPU

**TL;DR**: Do we have a GPU or not.

```py
device = "cuda" if torch.cuda.is_available() else "cpu"
```

That one‑liner is a quick “gear check” for PyTorch: it asks, *“Do I have an NVIDIA GPU with CUDA drivers ready?”*—that’s what `torch.cuda.is_available()` returns. If the answer is **yes**, the variable `device` is set to the string `"cuda"`, which tells PyTorch to park tensors and run math on the GPU (turbo‑charged parallel cores). If the answer is **no**—maybe you’re on a laptop with integrated graphics—the expression falls back to `"cpu"`, so everything runs on the regular processor instead. In short, it’s a portable toggle that automatically picks the fastest hardware you actually have, no manual edits required.

## What is a generator?

Defining a generator sets up a private dice-roller for PyTorch that always starts at the same spot:

1. `torch.Generator(device)` → creates a random‑number generator that lives on your GPU (or CPU) instead of using the global one.
2. `.manual_seed(1337)` → rewinds that generator to the exact same “first roll” every run.

Pass this `generator` to Diffusers and the noise it injects—and therefore the final image—will be identical as long as everything else is unchanged. Drop the seed or reuse the same generator for multiple calls, and the randomness (and thus your pictures) will start diverging.



## What Are Diffusion Models?

Diffusion models are a class of generative AI models that create images by gradually removing noise from random patterns. Think of it like this:

1. **Forward Process**: Take a clear image and gradually add noise until it becomes pure random noise
2. **Reverse Process**: Train a model to reverse this process - start with noise and gradually remove it to create an image

### Key Components

1. **U-Net**: The neural network that predicts and removes noise
2. **Text Encoder**: Converts your text prompt into embeddings the model understands
3. **Scheduler**: Controls how noise is added/removed over multiple steps
4. **VAE (Variational Autoencoder)**: Compresses images to/from latent space for efficiency

# Prompt Engineering

The quality of your generated images heavily depends on how you write your prompts. Let's talk about some best practices.

### Prompt Structure

A good prompt typically includes:
1. **Subject**: What/who is in the image
2. **Style**: Artistic style, medium, or technique
3. **Quality modifiers**: Words that enhance quality
4. **Lighting**: How the scene is lit
5. **Composition**: Camera angle, framing
6. **Details**: Specific attributes or characteristics

### Negative Prompts

**Negative prompts** are simply “don’t-do-this” instructions you feed alongside your regular text prompt. In a Stable Diffusion pipeline that uses *classifier-free guidance* (CFG), the model always denoises the image twice at every step—once with your positive prompt and once with an *un-conditioned* prompt (normally an empty string). The final update is:

$$
\text{predicted_noise}= \text{uncond} + s \,(\text{cond}-\text{uncond}),
$$

*s* is the guidance scale. If you replace that empty un-conditioned prompt with a sentence listing unwanted features (e.g. “blurry, extra fingers, watermark”), you tilt the subtraction term so the model actively steers away from those concepts: the more they would appear, the stronger the vector pointing in the opposite direction. Practically, this lets you ban artifacts (like text or low-res edges), rein in style drift, or remove whole objects without retraining, because the negative prompt injects a *repulsive* semantic direction into every denoising step. Empirically and in recent analyses, this “hijacked un-cond branch” explanation matches what we see: adding “worst quality, jpeg artifacts” to the negative prompt reliably suppresses those flaws, while research papers show the latent-space cancellation effect that deletes the specified concepts. (Sources: [Stable Diffusion Art][1], [arXiv][2])

[1]: https://stable-diffusion-art.com/how-negative-prompt-work/?utm_source=chatgpt.com "How does negative prompt work? - Stable Diffusion Art"
[2]: https://arxiv.org/html/2406.02965v1?utm_source=chatgpt.com "Understanding the Impact of Negative Prompts: When and How Do ..."


In [None]:
# @title Using Negative Prompts {"display-mode":"both"}
model_id = "stabilityai/stable-diffusion-xl-bse-1.0" # @param ["runwayml/stable-diffusion-v1-5","stabilityai/stable-diffusion-2-1","stabilityai/stable-diffusion-xl-bse-1.0","CompVis/stable-diffusion-v1-4","prompthero/openjourney","hakurei/waifu-diffusion","nitrosocke/Ghibli-Diffusion"]


pipeline = AutoPipelineForText2Image.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    use_safetensors=True,
).to(device)

# drop‑in scheduler swap (no retraining needed)
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config)

pipeline.enable_attention_slicing()   # VRAM saver

# EXEMPLAR (positive) prompt — subject → environment → style → camera/lens → lighting
prompt = (
    "Ultra‑realistic cinematic photo of a pygmy hippo jay‑walking through a neon‑lit "
    "downtown street at dusk, rain‑soaked asphalt reflecting colored lights, 85 mm lens, "
    "f/1.4 bokeh, dramatic rim lighting, shot on Kodak Portra 800, 8‑K resolution"
)

# NEGATIVE prompt — ban common artifacts & unwanted elements
negative_prompt = (
    "blurry, grainy, lowres, overexposed, watermark, text, logo, extra limbs, cars, people, "
    "distorted anatomy, jpeg artifacts"
)

width = 512 # @param {"type":"number"}
height = 512 # @param {"type":"number"}

image = pipeline(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=28,
    guidance_scale=7.0,
    width=width,
    height=height,
    generator=generator,
).images[0]

output.clear()

image

# Explanation

## `DPMSolverMultistep`

Think of your picture-painting robot (the `pipe`) as having a little instruction booklet that tells it exactly how to wipe away the foggy noise and reveal the picture. That line of code is like saying, “Hey robot, swap your old booklet for this *new* booklet called `DPMSolverMultistep`, but copy over all the page numbers so you still know when to start and stop each wipe.” So the robot keeps the same schedule (how many wipes, when they happen) but uses a smarter cleaning style that can make the picture look nicer or appear faster.


## Enable Attention Slicing

Imagine you have a huge coloring book page that's too big to fit on your little table. Instead of trying to color the whole giant page at once, you fold it into smaller squares and color one square at a time, then unfold it to see the full picture—easy and no mess. `pipe.enable_attention_slicing()` tells the computer to do the same thing with its “brain power” when making an image: it looks at one small chunk of the picture at a time instead of everything all at once, so it doesn't run out of room (memory) and can still finish the whole picture nicely.

## Why though?

In short, attention slicing keeps your GPU from running out of memory, while the DPM++ scheduler squeezes more quality (or speed) out of each denoising step. Together they make the pipeline both lighter and sharper without touching the model weights or your prompt.

# Other Models to Experiment With

- `runwayml/stable-diffusion-v1-5`: Classic SD 1.5—Fast, versatile
- `stabilityai/stable-diffusion-2-1`: SD 2.1—Better faces, higher res
- `stabilityai/stable-diffusion-xl-bse-1.0`: SDXL—Highest quality, slower
- `CompVis/stable-diffusion-v1-4`: Original SD 1.4
- `prompthero/openjourney`: Midjourney style
- `hakurei/waifu-diffusion`: Anime style
- `nitrosocke/Ghibli-Diffusion`: Studio Ghibli style