<a href="https://colab.research.google.com/github/shricastic/WhisKo-AI/blob/master/WhisKo_AI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

In [5]:
# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"

In [6]:
# Load Whisper model
processor = AutoProcessor.from_pretrained("openai/whisper-small")
model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-small")
model.to(device)

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 768)
      (layers): ModuleList(
        (0-11): 12 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
        

In [7]:
# Load audio file
audio_input = "Rev.mp3"
waveform, sample_rate = torchaudio.load(audio_input)

print(sample_rate)

48000


In [8]:
# \Need to resample since whisper can mostly work with 16000
if sample_rate != 16000:
    transform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = transform(waveform)
    sample_rate = 16000

In [10]:
# Preprocess audio
inputs = processor(waveform.squeeze(0), sampling_rate=sample_rate, return_tensors="pt")
inputs = {key: value.to(device) for key, value in inputs.items()}

print(inputs)

{'input_features': tensor([[[-0.6748, -0.6748, -0.6748,  ..., -0.6748, -0.6748, -0.6748],
         [-0.6748, -0.6748, -0.6748,  ..., -0.6748, -0.6748, -0.6748],
         [-0.6748, -0.6748, -0.6748,  ..., -0.6748, -0.6748, -0.6748],
         ...,
         [-0.6748, -0.6748, -0.6748,  ..., -0.6748, -0.6748, -0.6748],
         [-0.6748, -0.6748, -0.6748,  ..., -0.6748, -0.6748, -0.6748],
         [-0.6748, -0.6748, -0.6748,  ..., -0.6748, -0.6748, -0.6748]]],
       device='cuda:0')}


In [11]:
# Perform STT
with torch.no_grad():
  generated_ids = model.generate(**inputs, forced_decoder_ids=processor.get_decoder_prompt_ids(language="en", task="transcribe"))

print(generated_ids)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


tensor([[ 1911,    11,   393,   291,   976,   385,   257,  5353,   466,  1337,
          1166,  7318,   293,   577, 25242,  2316,  1985,   293,   437,   307,
           264,  4088,  9482,   926,   309,   293,  1338,    11,  1310,   577,
          4825,    12,  8014,   338,  7318,  1985,    13]], device='cuda:0')


In [12]:
# Decode transcription
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("Transcription:", transcription)

Transcription:  Hey, can you give me a brief about generative AI and how diffusion model works and what is the transform architecture around it and yeah, maybe how multi-model AI works.


In [13]:
#Import llm api keys
from google.colab import userdata
LLM_API_KEY = userdata.get('GEMINI_API_KEY')

In [14]:
# setup llm interface
from openai import OpenAI

llm = OpenAI(
    api_key=LLM_API_KEY,
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

In [15]:
prompt = f"""I will provide a question by user reply to it in 200 words as a helpful assistant
          question: {transcription}
        """

response = llm.chat.completions.create(
    model="gemini-1.5-flash",
    n=1,
    messages=[
        {"role": "system", "content": "You are a helpful customer representative and assistant."},
        {
            "role": "user",
            "content": prompt,
        }
    ]
)

print(response)

ChatCompletion(id=None, choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Hey there!  Let\'s break down generative AI, diffusion models, transformers, and multimodal AI.\n\nGenerative AI refers to AI models capable of creating new content, like images, text, or music, rather than just analyzing existing data.  They learn patterns from training data and then generate novel outputs that share similar characteristics.\n\nDiffusion models are a type of generative AI that work by gradually adding noise to an image (or other data) until it becomes pure noise, then learning to reverse this process.  They learn to denoise the image step-by-step, eventually reconstructing the original or generating a new, similar image.  Think of it like slowly blurring a photo until it\'s completely unrecognizable, then learning to sharpen it back to a clear image, possibly even a different but similar one.\n\nTransformer architecture is crucial to many genera

In [16]:
llm_response = response.choices[0].message.content
print(llm_response)

Hey there!  Let's break down generative AI, diffusion models, transformers, and multimodal AI.

Generative AI refers to AI models capable of creating new content, like images, text, or music, rather than just analyzing existing data.  They learn patterns from training data and then generate novel outputs that share similar characteristics.

Diffusion models are a type of generative AI that work by gradually adding noise to an image (or other data) until it becomes pure noise, then learning to reverse this process.  They learn to denoise the image step-by-step, eventually reconstructing the original or generating a new, similar image.  Think of it like slowly blurring a photo until it's completely unrecognizable, then learning to sharpen it back to a clear image, possibly even a different but similar one.

Transformer architecture is crucial to many generative AI models, including some diffusion models.  Transformers excel at processing sequential data (like text or time series) by atte

In [17]:
# install kokoro for using tts
!pip install -q kokoro>=0.3.4 soundfile
!apt-get -qq -y install espeak-ng > /dev/null 2>&1

In [18]:
# Initalize a pipeline
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf

In [25]:
pipeline = KPipeline(lang_code='a')

generator = pipeline(
    llm_response, voice='af_heart',
    speed=1, split_pattern=r''
)

print(generator)

<generator object KPipeline.__call__ at 0x79e498759e80>


In [26]:
for i, (gs, ps, audio) in enumerate(generator):
    print(i)
    print(gs)
    print(ps)
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000)

0
Hey there!  Let's break down generative AI, diffusion models, transformers, and multimodal AI.

Generative AI refers to AI models capable of creating new content, like images, text, or music, rather than just analyzing existing data.  They learn patterns from training data and then generate novel outputs that share similar characteristics.
hˈA ðˈɛɹ! lˈɛts bɹˈAk dˌWn ʤˈɛnəɹəTɪv ˈAˌI, dəfjˈuʒən mˈɑdᵊlz, tɹænsfˈɔɹməɹz, ænd mˌʌltimˈOdᵊl ˈAˌI.ʤˈɛnəɹɹətˌɪv ˈAˌI ɹəfˈɜɹz tʊ ˈAˌI mˈɑdᵊlz kˈApəbᵊl ʌv kɹiˈATɪŋ nˈu kˈɑntɛnt, lˈIk ˈɪmɪʤᵻz, tˈɛkst, ɔɹ mjˈuzɪk, ɹˈæðəɹ ðən ʤˈʌst ˈænəlˌIzɪŋ ɪɡzˈɪstɪŋ dˈATə. ðA lˈɜɹn pˈæTəɹnz fɹʌm tɹˈAnɪŋ dˈATə ænd ðˈɛn ʤˈɛnəɹˌAt nˈɑvᵊl ˈWtpˌʊts ðæt ʃˈɛɹ sˈɪmᵊləɹ kˌɛɹəktəɹˈɪstɪks.


1
Diffusion models are a type of generative AI that work by gradually adding noise to an image (or other data) until it becomes pure noise, then learning to reverse this process.  They learn to denoise the image step-by-step, eventually reconstructing the original or generating a new, similar image.  Think of it like slowly blurring a photo until it's completely unrecognizable, then learning to sharpen it back to a clear image, possibly even a different but similar one.
dᵻfjˈuʒən mˈɑdᵊlz ɑɹ ɐ tˈIp ʌv ʤˈɛnəɹəTɪv ˈAˌI ðæt wˈɜɹk bI ɡɹˈæʤəwəli ˈædɪŋ nˈYz tə ɐn ˈɪmɪʤ (ɔɹ ˈʌðəɹ dˈATə) ˌʌntˈɪl ɪt bəkˈʌmz pjˈʊɹ nˈYz, ðˈɛn lˈɜɹnɪŋ tə ɹəvˈɜɹs ðɪs pɹˈɑsˌɛs. ðA lˈɜɹn tə dᵻnˈYs ði ˈɪmɪʤ stˈɛpbIstˈɛp, əvˈɛnʧəli ɹˌikənstɹˈʌktɪŋ ði əɹˈɪʤənᵊl ɔɹ ʤˈɛnəɹˌATɪŋ ɐ nˈu, sˈɪmᵊləɹ ˈɪmɪʤ. θˈɪŋk ʌv ɪt lˈIk slˈOli blˈɜɹɪŋ ɐ fˈOTO ˌʌntˈɪl ɪts kəmplˈitli ˌʌnɹˌɛkəɡnˈIzəbᵊl, ðˈɛn lˈɜɹnɪŋ tə ʃˈɑɹpən ɪt bˈæk tə ɐ klˈɪɹ ˈɪmɪʤ, pˈɑsəbli ˈivən ɐ dˈɪfəɹənt bˌʌt sˈɪmᵊləɹ wˈʌn.


2
Transformer architecture is crucial to many generative AI models, including some diffusion models.  Transformers excel at processing sequential data (like text or time series) by attending to different parts of the input simultaneously. This "attention mechanism" allows the model to understand relationships between words or elements across longer sequences, crucial for generating coherent and contextually relevant outputs.
tɹænsfˈɔɹməɹ ˈɑɹkətˌɛkʧəɹ ɪz kɹˈuʃᵊl tə mˈɛni ʤˈɛnəɹəTɪv ˈAˌI mˈɑdᵊlz, ɪnklˈudɪŋ sˌʌm dəfjˈuʒən mˈɑdᵊlz. tɹænsfˈɔɹməɹz ɪksˈɛl æt pɹˈɑsɛsɪŋ səkwˈɛnʧᵊl dˈATə (lˈIk tˈɛkst ɔɹ tˈIm sˈɪɹiz) bI ətˈɛndɪŋ tə dˈɪfəɹənt pˈɑɹts ʌv ði ˈɪnpˌʊt sˌIməltˈAniəsli. ðˌɪs “ətˈɛnʧən mˈɛkənˌɪzəm” əlˈWz ðə mˈɑdᵊl tʊ ˌʌndəɹstˈænd ɹəlˈAʃənʃˌɪps bətwˈin wˈɜɹdz ɔɹ ˈɛləmənts əkɹˈɔs lˈɔŋɡəɹ sˈikwənsᵻz, kɹˈuʃᵊl fɔɹ ʤˈɛnəɹˌATɪŋ kOhˈɪɹənt ænd kəntˈɛksʧəwəli ɹˈɛləvᵊnt ˈWtpˌʊts.


3
They form the backbone of models like GPT-3 and others used in image generation.

Multimodal AI refers to systems that can process and understand information from multiple modalities, such as text, images, audio, and video.  Instead of working with just one type of data, they integrate information from various sources to generate more comprehensive and nuanced outputs.
ðA fˈɔɹm ðə bˈækbˌOn ʌv mˈɑdᵊlz lˈIk ʤˌipˌitˈi θɹˈi ænd ˈʌðəɹz jˈuzd ɪn ˈɪmɪʤ ʤˌɛnəɹˈAʃən.mˌʌltɪmˈOdᵊl ˈAˌI ɹəfˈɜɹz tə sˈɪstəmz ðæt kæn pɹˈɑsˌɛs ænd ˌʌndəɹstˈænd ˌɪnfəɹmˈAʃən fɹʌm mˈʌltəpᵊl mOdˈæləTiz, sˈʌʧ æz tˈɛkst, ˈɪmɪʤᵻz, ˈɔdiO, ænd vˈɪdiO. ɪnstˈɛd ʌv wˈɜɹkɪŋ wɪð ʤˈʌst wˈʌn tˈIp ʌv dˈATə, ðA ˈɪntəɡɹˌAt ˌɪnfəɹmˈAʃən fɹʌm vˈɛɹiəs sˈɔɹsᵻz tə ʤˈɛnəɹˌAt mˈɔɹ kˌɑmpɹəhˈɛnsɪv ænd nˈuˌɑnst ˈWtpˌʊts.


4
For example, a multimodal AI could generate a caption for an image, describe a scene in a video, or even create a story based on a combination of text and image inputs.  The underlying models often combine different architectures to handle each modality effectively.
fˌɔɹ ɪɡzˈæmpəl, ɐ mˌʌltimˈOdᵊl ˈAˌI kʊd ʤˈɛnəɹˌAt ɐ kˈæpʃən fɔɹ ɐn ˈɪmɪʤ, dəskɹˈIb ɐ sˈin ɪn ɐ vˈɪdiO, ɔɹ ˈivən kɹiˈAt ɐ stˈɔɹi bˈAst ˌɔn ɐ kˌɑmbənˈAʃən ʌv tˈɛkst ænd ˈɪmɪʤ ˈɪnpˌʊts. ðə ˌʌndəɹlˈIɪŋ mˈɑdᵊlz ˈɔfᵊn kəmbˈIn dˈɪfəɹənt ˈɑɹkətˌɛkʧəɹz tə hˈændəl ˈiʧ mOdˈæləTi əfˈɛktəvli.
