# Creating a multimodal pipeline with Gradio

In this homework, you will create a multimodal pipeline with the following components:

1. Speech to text to convert audio to text
2. Large language model to generate text based on the input text
3. Text to speech to convert the generated text to audio

Adapted from: https://www.gradio.app/guides/real-time-speech-recognition

In [1]:
# %pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
%pip install torch==2.0.1+cu118 --index-url https://download.pytorch.org/whl/cu118
%pip install torchaudio==2.0.2
%pip install -U ninja packaging accelerate bitsandbytes bitsandbytes einops hf_transfer gradio pillow transformers
%pip install flash-attn --no-build-isolation
%pip install datasets

Looking in indexes: https://download.pytorch.org/whl/cu118


# Load libraries

In [2]:
from datasets import load_dataset
import gradio as gr
import numpy as np
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load Speech to Text Model (20%)

It should take in speech from Gradio and output text.

Check out this guide to see how to complete it:

https://www.gradio.app/guides/real-time-speech-recognition

In [3]:
transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-base.en", device=device)

def speech2text(audio) -> str:
    """This takes in Gradio's audio input and returns the transcribed text."""
    #### YOUR CODE HERE ####
    sr, y = audio
    y = y.astype(np.float32)
    y /= np.max(np.abs(y))
    output = transcriber({"sampling_rate": sr, "raw": y})["text"]

    return output

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


# Load LLM (20%)

Check the docs to see how to load the model.
Define `call_llm` which takes a string as input and returns a string as output.

Docs: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

In [4]:
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "do_sample": False,
}

llm_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=device,
)

def call_llm(text: str) -> str:
    """This should call the LLM and output the generated text"""
    # Check the docs to see how to format the input
    #### YOUR CODE HERE ####
    messages = [{"role": "user", "content": f"{text}"}]
    generate = llm_pipe(messages, **generation_args)
    output = generate[0]['generated_text']

    return output

# Load Text to Speech model  (20%)

Docs: https://huggingface.co/microsoft/speecht5_tts

In [6]:
synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts", device=device)

embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)


def text2speech(text) -> str:
    """This should call the TTS model and save the output to a file and return the file name."""
    file_name = "speech.wav"
    #### YOUR CODE HERE ####
    speech = synthesiser(text, forward_params={"speaker_embeddings": speaker_embedding})
    sf.write(file_name, speech["audio"], samplerate=speech["sampling_rate"])

    return file_name

  return self.fget.__get__(instance, owner)()


# Putting it all together  (20%)

Complete the pipeline by feeding the output of the speech to text model to the LLM and then the output of the LLM to the text to speech model.

In [7]:
def demo(audio):
    """This should call the STT, LLM, and TTS and return the generated audio"""
    # Call the STT
    stt = speech2text(audio)

    # Call the LLM
    llm_output = call_llm(stt)

    # Call the TTS
    tts_output = text2speech(llm_output)

    # Return the transcribed text, generated text, and the file name of the generated audio
    return stt, llm_output, tts_output

## Create the Gradio Interface (20%)

1. The input should accept audio either from a microphone or uploaded
2. The output is THREE elements:
    - The text generated by STT: the label should be "STT Output
    - The text generated by the LLM: the label should be "LLM Output"
    - The audio generated by the TTS: the label should be "TTS Output"

**Make sure that it runs!**

Check out the Gradio documentation for help:

https://www.gradio.app/guides/quickstart

Here is sample audio that is known to work:
[My dog is cooler](https://drive.google.com/file/d/1JWvL-VRT_PIRtKIleQViHfxJnwpZmtEW/view?usp=sharing)

Your interface should look like this:

![gradio](https://i.imgur.com/na0GKvW.png)

In [14]:
demo = gr.Interface(
    demo,
    inputs="audio",
    outputs=[gr.Textbox(label="STT Output"), gr.Textbox(label="LLM Output"), gr.Audio(label="TTS Output")]
)

demo.launch(debug=True)

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://c54963ca19feab653c.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://c54963ca19feab653c.gradio.live


