<a href="https://colab.research.google.com/github/roberttwomey/machine-imagination/blob/main/generate_from_stored.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BigGAN + CLIP + CMA-ES: Interpolation

This notebook generates latent interpolations between different images produced by BigGAN+CLIP+CMA-ES. Given a series of class and noise vectors (each a point in "latent space"), we will generate intermediate images and save the results as a video. 

---

This notebook is based off of [j.mp/wanderclip](https://j.mp/wanderclip) by Eyal Gruss [@eyaler](https://twitter.com/eyaler) [eyalgruss.com](https://eyalgruss.com).

I've modified it to store noise/class vectors for reuse, and adapted it to run on [nautilus.optiputer.net](https://nautilus.optiputer.net)/z8 (not relevant here, because we're on colab because it's free). robert.twomey@gmail.com

In [None]:
#@title 1. Setup software libraries (run once)
#@markdown This cell installs the software libraries necessary to run our 
#@markdown text-to-image code on this Colab instance: CUDA, torch, torchvision.

#@markdown Run this cell once (press the play button at top left). 

#@markdown (this takes around 4-5 minutes to run)

#@markdown Afterwards, restart the kernel. Select __Runtime -> Restart runtime__
#@markdown from the top menu.

#@markdown Move on to Step 2 once you have restarted.

!pip install ipython-autotime
%load_ext autotime

# prints out what graphics card we have
!nvidia-smi -L

import subprocess

CUDA_version = [s for s in subprocess.check_output(["nvcc", "--version"]).decode("UTF-8").split(", ") if s.startswith("release")][0].split(" ")[-1]
print("CUDA version:", CUDA_version)

if CUDA_version == "10.0":
    torch_version_suffix = "+cu100"
elif CUDA_version == "10.1":
    torch_version_suffix = "+cu101"
elif CUDA_version == "10.2":
    torch_version_suffix = ""
else:
    torch_version_suffix = "+cu110"

!pip install torch==1.7.1{torch_version_suffix} torchvision==0.8.2{torch_version_suffix} -f https://download.pytorch.org/whl/torch_stable.html ftfy regex

In [None]:
#@title 2. Install ML Models
#@markdown Installs BigGAN — the image generator network. That is all we need
#@markdown to create our latent walks. Everything else is already in colab.

#@markdown (this takes around 1 minute to run)

# BigGAN
!pip install pytorch-pretrained-biggan

from IPython.display import HTML, clear_output
from PIL import Image
from IPython.display import Image as JupImage
import numpy as np
import nltk
from scipy.stats import truncnorm

# from biggan
import torch
from pytorch_pretrained_biggan import (BigGAN, one_hot_from_names, truncated_noise_sample,
                                       save_as_images, convert_to_images) #, display_in_terminal)
import logging
logging.basicConfig(level=logging.WARNING)

# do we need wordnet?
nltk.download('wordnet')

# load biggan
model = BigGAN.from_pretrained('biggan-deep-512')
print("loaded bigGAN")

In [None]:
#@title 3. Upload your stored class and noise vectors

#@markdown Click on "Choose Files" below, and select your ...noise.txt and
#@markdown ...class.txt files from before. (For instance "sunrise through a 
#@markdown window_1_class.txt", "sunrise through a window_1_noise.txt")

#@markdown Upload as many pairs of files as you would like. We will generate 
#@markdown your latent interpolation ("latent walk") from these points in image 
#@markdown space.

from google.colab import files

uploaded = files.upload()

# for fn in uploaded.keys():
#   print('User uploaded file "{name}" with length {length} bytes'.format(
#       name=fn, length=len(uploaded[fn])))

In [None]:
# set your prompts and order here (copy the text from above), but do not 
# include the "_class.txt" part or "_noise.txt" part. So just the stem of each
# phrase. The order matters, and you can repeat.

prompts = [
    "a sunrise through a window_1",
    "a dog sitting on a couch_1", 
    "a cat in a refrigerator_255"
]

In [None]:
#@title 4. Generate a latent walk!

#@markdown This cell takes each of the points in latent space (coordinates for
#@markdown images, and interpolates between them to create a smoothly flowing
#@markdown traversal ("walk") through the space of possible images. 

#@markdown Set the following parameters to shape your output movie. FPS is 
#@markdown frames per second of the output film. num_steps is how many frames
#@markdown between each succesive phrase/image, and len_hold is how many frames
#@markdown to pause on each resultant image.

# the movie
fps = 30 #@param {type: 'number'}

# the interpolation
num_steps = 90 #@param {type:'number'}
len_hold = 30 #@param {type: 'number'}

truncation = 1.0

interpbase = '/content/'
moviefilename = 'interpolation_%s.mp4'

import numpy as np
from numpy import asarray
from numpy import vstack
from numpy.random import randn
from numpy.random import randint
from numpy import arccos
from numpy import clip
from numpy import dot
from numpy import sin
from numpy import linspace
from numpy.linalg import norm
import os
import glob

# from
# https://discuss.pytorch.org/t/help-regarding-slerp-function-for-generative-model-sampling/32475/4

# spherical linear interpolation (slerp)
def slerp(val, low, high):
    omega = arccos(clip(dot(low/norm(low), high/norm(high)), -1, 1))
    so = sin(omega)
    if so == 0:
        # L'Hopital's rule/LERP
        return (1.0-val) * low + val * high
    return sin((1.0-val)*omega) / so * low + sin(val*omega) / so * high
 
# uniform interpolation between two points in latent space
def interpolate_points(p1, p2, n_steps=10):
    # interpolate ratios between the points
    ratios = np.linspace(0, 1, num=n_steps)
    # linear interpolate vectors
    vectors = list()
    for ratio in ratios:
        v = slerp(ratio, p1, p2)
        vectors.append(v)
    return np.asarray(vectors)

def get_class_file(path, prompt):
    print(path+'%s*_class.txt'%prompt)
    result = glob.glob(path+'%s*_class.txt'%prompt)
    return(result)

def get_noise_file(path, prompt):
    print(path+'%s*_noise.txt'%prompt)    
    result = glob.glob(path+'%s*_noise.txt'%prompt)
    return(result)

class_filenames = [get_class_file('/content/', prompt)[0] for prompt in prompts]
noise_filenames = [get_noise_file('/content/', prompt)[0] for prompt in prompts]

# print(class_filenames, noise_filenames)

class_inputs = [np.loadtxt(filename) for filename in class_filenames]
noise_inputs = [np.loadtxt(filename) for filename in noise_filenames]

count = 0

# loop over inputs

for i in range(len(class_inputs)):

    # generate interpolations
    noises = interpolate_points(noise_inputs[i], noise_inputs[(i+1)%len(class_inputs)], num_steps)
    classes = interpolate_points(class_inputs[i], class_inputs[(i+1)%len(class_inputs)], num_steps)

    # generate images in batches
    batch_size = 10 # 50
    for j in range(0, num_steps, batch_size):
        clear_output()
        print(i, j, count)
        noise_vector = noises[j:j+batch_size]
        class_vector = classes[j:j+batch_size]

        # convert to tensors
        noise_vector = torch.tensor(noise_vector, dtype=torch.float32)
        class_vector = torch.tensor(class_vector, dtype=torch.float32)

        # put everything on cuda (GPU)
        noise_vector = noise_vector.to('cuda')
        noise_vector = noise_vector.clamp(-2*truncation, 2*truncation)
        class_vector = class_vector.to('cuda')
        class_vector = class_vector.softmax(dim=-1)
        model.to('cuda')

        # generate images
        with torch.no_grad():
            print(noise_vector.shape)
            print(class_vector.shape)
            output = model(noise_vector, class_vector, truncation)

        # If you have a GPU put back on CPU
        output = output.to('cpu')

        imgs = convert_to_images(output)

        # repeat first image
        
        if j == 0:
            for k in range(len_hold):
                imgs[0].save(interpbase+"/output_%05d.png" % count)
                count = count + 1
                
        for img in imgs: 
            img.save(interpbase+"/output_%05d.png" % count)
            count = count + 1

# generate mp4
out = moviefilename%fps
with open('list.txt','w') as f:
  for i in range(count):
    print('file %s/output_%05d.png\n'%(interpbase, i))
    f.write('file %s/output_%05d.png\n'%(interpbase, i))
!ffmpeg -r $fps -f concat -safe 0 -i list.txt -c:v libx264 -pix_fmt yuv420p -profile:v baseline -movflags +faststart -r $fps $out -y
# !echo ffmpeg -r $fps -f concat -safe 0 -i list.txt -c:v libx264 -pix_fmt yuv420p -profile:v baseline -movflags +faststart -r $fps $out -y
        
# os.system("ffmpeg -r {0} -f concat -safe 0 -i list.txt -c:v libx264 -pix_fmt yuv420p -profile:v baseline -movflags +faststart -r {0} {1} -y".format(fps, out))

# # # rename jpg
# # frame = 'frame_%05d.jpg'%(sample_num-1)
# # jpg = '%s.jpg'%prompt.replace(" ", "_")
# # !cp $frame $jpg
# print("ffmpeg -r {0} -f concat -safe 0 -i list.txt -c:v libx264 -pix_fmt yuv420p -profile:v baseline -movflags +faststart -r {0} {1} -y".format(fps, out))
  
# out = '"/content/%s_%d.mp4"'%(prompt, seed)
# with open('/content/list.txt','w') as f:
#   for i in range(sample_num):
#     f.write('file /content/output/frame_%05d.jpg\n'%i)
#   for j in range(int(freeze_secs*fps)):
#     f.write('file /content/output/frame_%05d.jpg\n'%i)
# !ffmpeg -r $fps -f concat -safe 0 -i /content/list.txt -c:v libx264 -pix_fmt yuv420p -profile:v baseline -movflags +faststart -r $fps $out -y

with open(moviefilename%fps, 'rb') as f:
  data_url = "data:video/mp4;base64," + b64encode(f.read()).decode()
display(HTML("""
  <video controls autoplay loop>
        <source src="%s" type="video/mp4">
  </video>""" % data_url))

# from google.colab import files, output
output.eval_js('new Audio("https://freesound.org/data/previews/80/80921_1022651-lq.ogg").play()')
files.download(moviefilename%fps)

# Explainer

There are three parts to this generative system:
1. BigGAN is our image generation network
2. CLIP is our text-to-image association network (textual descriptions, really)
3. CMA-ES is our search/optimizer strategy.

### BigGAN

BigGAN (https://arxiv.org/abs/1809.11096) is a variety of Generative Adversarial Network (GAN) that set a standard for high resolution, high fidelity image synthesis in 2018. It contained four times as many parameters and eight times the batch size fo previous models, and synthesized a state of the art 512 x 512 pixel image across [1000 different classes](https://gist.githubusercontent.com/yrevar/942d3a0ac09ec9e5eb3a/raw/238f720ff059c1f82f368259d1ca4ffa5dd8f9f5/imagenet1000_clsidx_to_labels.txt) from [Imagenet](https://www.image-net.org/). It was also prohibitively expensive to train! Thankfully Google has released a number of pretrained models for us to explore.

BigGAN takes two inputs: a 256-dimensional "noise" vector, and a 1000-dimensional one hot "class" vector. The "class" selects which category of image it is trying to generate (or what mix of categories). The noise vector (latent vector) determines the appearance of this particular instance from within the category ("dog" from "dogs").

You can substitue a different generative network with CLIP to achieve a similar aim. For instance, people are experimenting with [StyleGAN2 ADA](https://colab.research.google.com/drive/1J8xyNRTNVnkNbQJnidcgSdDCHHKfGa8N?usp=sharing#scrollTo=I-YJmx89HLro), [DALL-Es encoder](https://colab.research.google.com/drive/1NGM9L8qP0gwl5z5GAuB_bd0wTNsxqclG), [Lucent's FFT](https://colab.research.google.com/github/eps696/aphantasia/blob/master/Illustra.ipynb). Each generative approach will have different qualities, aesthetics, representational range due to training data and method. Reddit users have produced [a list of CLIP notebooks](https://www.reddit.com/r/MachineLearning/comments/ldc6oc/p_list_of_sitesprogramsprojects_that_use_openais/) from this hobbyist/experimenter community.

### CLIP
OpenAI's CLIP is the key part of this text-to-image translation.

[CLIP model card](https://github.com/openai/CLIP/blob/main/model-card.md)

### CMA-ES

Covariance matrix adaptation evolution strategy (CMA-ES) is a strategy for numerical optimization.

This is our strategy for searching what combinations of noise + class vector (BigGAN inputs) produce the best representation of the prompt, according to CLIP (which knows how to relate images and textual descriptions). From a given starting point (random class, random noise), CMA-ES guides the changes in class and noise to improve the output image from BigGAN, to better satisfy CLIP. 

You can use this same GAN + CLIP architecture with a different optimizer to achieve a similar aim.



# Activities
- Try experimenting with different prompts, but leave the other fields the same.
  - Change the prompt and select "Runtime->restart and run all" from the top menu. 
- Try textual prompts of different forms. Instead of "a photo of", try "a drawing of", "a picture of", something else. Or "a drawing of X, a type of Y" as mentioned above. 
- To produce different results with the same prompt, try changing the seed. How do your results change?
- Save any results you like. Since we seeded the random value and ran it fresh each time ("Runtime->restart and run all") these results should be replicable.

# References

- Based on SIREN+CLIP Colabs by: [@advadnoun](https://twitter.com/advadnoun), [@norod78](https://twitter.com/norod78)

Using the works:
- https://github.com/openai/CLIP
- https://tfhub.dev/deepmind/biggan-deep-512
- https://github.com/huggingface/pytorch-pretrained-BigGAN
- http://www.aiartonline.com/design-2019/eyal-gruss (WanderGAN)
- Other CLIP notebooks: https://www.reddit.com/r/MachineLearning/comments/ldc6oc/p_list_of_sitesprogramsprojects_that_use_openais
- A curated list of more online generative tools see: [j.mp/generativetools](https://j.mp/generativetools)

Other CLIP notebooks:
- BigSLEEP (from [@advadnoun](https://twitter.com/advadnoun)): https://colab.research.google.com/drive/1NCceX2mbiKOSlAd_o7IU7nA9UskKN5WR?usp=sharing
