<a href="https://colab.research.google.com/github/sliscak/notebooks/blob/main/Advanced_Whisper%2BStable_Diffusion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Speech to image using [Whisper](https://github.com/openai/whisper) and [Stable Diffusion](https://github.com/CompVis/stable-diffusion) from [Diffusers](https://github.com/huggingface/diffusers) library

---




### Install requirements

In [1]:
%%capture
!pip install --upgrade diffusers
!pip install --upgrade gradio
!pip install --upgrade ftfy
!pip install --upgrade accelerate
!pip install git+https://github.com/openai/whisper.git

In [2]:
import gradio as gr
import whisper
import os
import torch
from torch import autocast
from diffusers import StableDiffusionPipeline
from google.colab import output
from huggingface_hub import notebook_login

In [3]:
output.enable_custom_widget_manager()

In [4]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-8b68b2fc-2397-2c2d-e89d-ac31e5c783e3)


In [5]:
# login to verify license
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token


In [None]:
# device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'cuda'
LOW_VRAM = False

if LOW_VRAM is True:
  pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", 
    revision="fp16", 
    torch_dtype=torch.float16,
    use_auth_token=True
  )
  # pipe.enable_attention_slicing()
else:
  pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_auth_token=True)
  pipe = pipe.to(device)

model = whisper.load_model("base").to(device) # or small, etc

def transcribe(audio, language=None):
    audio = whisper.load_audio(audio)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    if language is None or language == 'Autodetect':
      _, probs = model.detect_language(mel)
      language = max(probs, key=probs.get)
    options = whisper.DecodingOptions(language=language, task='translate')
    prompt = whisper.decode(model, mel, options, ).text
    # print(prompt)
    with autocast(device):
        image = pipe(prompt).images[0]
    return f'Detected language: {language}', prompt, image

# block = gr.Blocks(css=".container { margin: auto; }")
demo = gr.Interface(
        fn=transcribe,
        inputs=[gr.Audio(source="microphone", type="filepath"),
                # let the user choose a language in case it was not correctly detected.
                gr.Dropdown(["Autodetect"] + list(whisper.tokenizer.LANGUAGES.keys()), value="Autodetect")],
        outputs=["text","text", "image"])

demo.launch(debug=True)

Downloading:   0%|          | 0.00/543 [00:00<?, ?B/s]

Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/313 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/592 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/492M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/806 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/743 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/522 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/335M [00:00<?, ?B/s]

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().

Using Embedded Colab Mode (NEW). If you have issues, please use share=True and file an issue at https://github.com/gradio-app/gradio/
Note: opening the browser inspector may crash Embedded Colab Mode.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>