<a href="https://colab.research.google.com/github/sagiodev/stablediffusion_webui/blob/master/Stable_Diffusion_tokenizer_and_embedding_SDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Diffusion v1 tokenizer and embedding

This notebook examines tokens and embeddings used in Stable Diffusion v1.

Tutorials, prompts and resources at https://stable-diffusion-art.com

Modified from [Interacting with CLIP notebook](https://colab.research.google.com/github/openai/clip/blob/master/notebooks/Interacting_with_CLIP.ipynb).

# Setup


In [1]:
! pip install ftfy regex tqdm
! pip install git+https://github.com/openai/CLIP.git

import numpy as np
import torch
from pkg_resources import packaging
print("Torch version:", torch.__version__)

import clip
print('Available models:')
print(clip.available_models())

model, preprocess = clip.load("ViT-L/14") # used by stable diffusion v1
model.cuda().eval()
input_resolution = model.visual.input_resolution
context_length = model.context_length
vocab_size = model.vocab_size

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
print("Input resolution:", input_resolution)
print("Context length:", context_length)
print("Vocab size:", vocab_size)


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.9 MB/s 
Installing collected packages: ftfy
Successfully installed ftfy-6.1.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-d9d9s9wi
  Running command git clone -q https://github.com/openai/CLIP.git /tmp/pip-req-build-d9d9s9wi
Building wheels for collected packages: clip
  Building wheel for clip (setup.py) ... [?25l[?25hdone
  Created wheel for clip: filename=clip-1.0-py3-none-any.whl size=1369408 sha256=83353fc7530428ea6d36bc33364ac60757bb68cca0cf06956739efe2ef22ca6f
  Stored in directory: /tmp/pip-ephem-wheel-cache-4j8wdad4/wheels/ab/4f/3a/5e51521b55997aa6f0690e095c08824219753128ce8d9969a3
Successfully

100%|███████████████████████████████████████| 890M/890M [00:25<00:00, 37.0MiB/s]


Model parameters: 427,616,513
Input resolution: 224
Context length: 77
Vocab size: 49408


# Token and embedding
Modify prompt to see tokens and embeddings

In [18]:
# modify prompt to check tokens 
prompt = "Photo of a cat"

tokens = clip.tokenize(prompt)
with torch.no_grad():
    embeddings = model.encode_text(tokens.cuda()).float()
print("text tokens:")
print(tokens)
print("text tokens size:", tokens.shape)
print("Embeddings size:", embeddings.shape )

text tokens:
tensor([[49406,  1125,   539,   320,  2368, 49407,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]], dtype=torch.int32)
text tokens size: torch.Size([1, 77])
Embeddings size: torch.Size([1, 768])
