<a href="https://colab.research.google.com/github/work4ai/AI_Voice_Assistant/blob/main/AI_Voice_Assistant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q transformers gradio gtts torchvision torchaudio
!pip install -q git+https://github.com/openai/whisper.git


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [2]:
import torch
import whisper
from transformers import BlipProcessor, BlipForQuestionAnswering
from PIL import Image
from gtts import gTTS
import gradio as gr
import os

# Load Whisper
whisper_model = whisper.load_model("base")

# Load BLIP
blip_processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
blip_model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


In [3]:
def transcribe_audio(audio_path):
    if audio_path is None:
        return ""
    audio = whisper.load_audio(audio_path)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to(whisper_model.device)
    options = whisper.DecodingOptions(fp16=False)
    result = whisper.decode(whisper_model, mel, options)
    return result.text.strip()

def analyze_image_vqa(image_path, question):
    image = Image.open(image_path).convert("RGB")
    prompt = question if question.strip() else "Describe the image in detail."

    inputs = blip_processor(image, prompt, return_tensors="pt")
    out = blip_model.generate(**inputs)
    result = blip_processor.decode(out[0], skip_special_tokens=True)
    return result

def speak_text(text, filename="response.mp3"):
    tts = gTTS(text=text, lang='en')
    tts.save(filename)
    return filename

In [None]:
def process(audio, image):
    question = transcribe_audio(audio)
    if not question:
        question = "Describe this image in detail."

    answer = analyze_image_vqa(image, question)
    audio_file = speak_text(answer)
    return question, answer, audio_file

iface = gr.Interface(
    fn=process,
    inputs=[
        gr.Audio(sources=["microphone"], type="filepath", label="Ask a Question"),
        gr.Image(type="filepath", label="Upload Image")
    ],
    outputs=[
        gr.Textbox(label="Transcribed Question"),
        gr.Textbox(label="Image Description"),
        gr.Audio(label="Spoken Answer")
    ],
    title="🖼️ Voice-Powered Image Description",
    description="Speak a question and upload an image. The app transcribes your voice, analyzes the image, and reads the answer aloud.",
)

iface.launch(debug=True)


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://dd387a95e8c80feb53.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
