Speech to image using [Whisper](https://github.com/openai/whisper) and [Stable Diffusion](https://github.com/CompVis/stable-diffusion) from [Diffusers](https://github.com/huggingface/diffusers) library

---




### Install requirements

In [1]:
%%capture
!pip install --upgrade diffusers
!pip install --upgrade gradio
!pip install --upgrade ftfy
!pip install --upgrade accelerate
!pip install git+https://github.com/openai/whisper.git

In [2]:
import gradio as gr
import whisper
import os
import torch
from torch import autocast
from diffusers import StableDiffusionPipeline
from google.colab import output
from huggingface_hub import notebook_login

In [3]:
output.enable_custom_widget_manager()

In [4]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-22475aeb-c069-1641-b865-7bf148d2f49b)


In [5]:
# login to verify license
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [6]:
# device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'cuda'
LOW_VRAM = True

if LOW_VRAM is True:
  pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", 
    revision="fp16", 
    torch_dtype=torch.float16,
    use_auth_token=True
  )
  pipe.enable_attention_slicing()
else:
  pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_auth_token=True)
  pipe = pipe.to(device)

model = whisper.load_model("base").to(device) # or small, etc

def transcribe(audio, language=None):
    audio = whisper.load_audio(audio)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    if language is None or language == 'Autodetect':
      _, probs = model.detect_language(mel)
      language = max(probs, key=probs.get)
    options = whisper.DecodingOptions(language=language, task='translate')
    prompt = whisper.decode(model, mel, options, ).text
    # print(prompt)
    with autocast(device):
        image = pipe(prompt).images[0]
    return f'Detected language: {language}', prompt, image

# block = gr.Blocks(css=".container { margin: auto; }")
demo = gr.Interface(
        fn=transcribe,
        inputs=[gr.Audio(source="microphone", type="filepath"),
                # let the user choose a language in case it was not correctly detected.
                gr.Dropdown(["Autodetect"] + list(whisper.tokenizer.LANGUAGES.keys()), value="Autodetect")],
        outputs=["text","text", "image"])

demo.launch(debug=True)

Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with: 
```
pip install accelerate
```
.


Downloading:   0%|          | 0.00/543 [00:00<?, ?B/s]

Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/313 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/592 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/492M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/806 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/743 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/522 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/335M [00:00<?, ?B/s]

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().

Using Embedded Colab Mode (NEW). If you have issues, please use share=True and file an issue at https://github.com/gradio-app/gradio/
Note: opening the browser inspector may crash Embedded Colab Mode.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



  0%|          | 0/51 [00:00<?, ?it/s]



  0%|          | 0/51 [00:00<?, ?it/s]



  0%|          | 0/51 [00:00<?, ?it/s]



  0%|          | 0/51 [00:00<?, ?it/s]

Potential NSFW content was detected in one or more images. A black image will be returned instead. Try again with a different prompt and/or seed.


  0%|          | 0/51 [00:00<?, ?it/s]

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/gradio/routes.py", line 290, in run_predict
    fn_index, raw_input, username, session_state, iterators
  File "/usr/local/lib/python3.7/dist-packages/gradio/blocks.py", line 982, in process_api
    result = await self.call_function(fn_index, inputs, iterator)
  File "/usr/local/lib/python3.7/dist-packages/gradio/blocks.py", line 825, in call_function
    block_fn.fn, *processed_input, limiter=self.limiter
  File "/usr/local/lib/python3.7/dist-packages/anyio/to_thread.py", line 32, in run_sync
    func, *args, cancellable=cancellable, limiter=limiter
  File "/usr/local/lib/python3.7/dist-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.7/dist-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "<ipython-input-6-54a8d20c55e2>", line 20, in transcribe
    audio = whisper.load_

Keyboard interruption in main thread... closing server.


(<gradio.routes.App at 0x7f1094426690>, 'http://127.0.0.1:7860/', None)