<a href="https://colab.research.google.com/github/sahilaf/Bangla-Voice-Assistant/blob/main/Whisper_bangla.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q gradio transformers torch accelerate

In [None]:
# Run this in Google Colab
# Bangla Speech-to-Text API with Grammar Correction

import gradio as gr
import torch
from transformers import pipeline
import numpy as np

# Load the models
print("Loading Bangla Whisper model...")
device = "cuda" if torch.cuda.is_available() else "cpu"

# ASR - Bangla Speech-to-Text
asr = pipeline(
    "automatic-speech-recognition",
    model="asif00/whisper-bangla",
    device=device
)
print(f"ASR model loaded on {device}")

# Grammar correction model
print("Loading grammar correction model...")
ser = pipeline(
    "text2text-generation",
    model="asif00/mbart_bn_error_correction",
    device=device
)
print(f"Grammar correction model loaded on {device}")

def transcribe_audio(audio_data, apply_correction=True):
    """
    Transcribe audio to Bangla text with optional grammar correction

    Args:
        audio_data: tuple of (sample_rate, audio_array) or audio file path
        apply_correction: Whether to apply grammar correction (default: True)

    Returns:
        str: Transcribed (and optionally corrected) text
    """
    try:
        # Handle different input formats
        if isinstance(audio_data, tuple):
            sample_rate, audio = audio_data
            # Convert to float32 and normalize
            if audio.dtype == np.int16:
                audio = audio.astype(np.float32) / 32768.0
            elif audio.dtype == np.int32:
                audio = audio.astype(np.float32) / 2147483648.0
        else:
            # If it's a file path, the pipeline can handle it directly
            audio = audio_data

        # Step 1: Transcribe audio
        print("Transcribing audio...")
        result = asr(audio)
        text = result["text"]
        print(f"Raw transcription: {text}")

        # Step 2: Apply grammar correction if enabled
        if apply_correction and text.strip():
            print("Applying grammar correction...")
            corrected = ser(text, max_length=512)
            corrected_text = corrected[0]["generated_text"]
            print(f"Corrected text: {corrected_text}")
            return corrected_text

        return text

    except Exception as e:
        error_msg = f"Error: {str(e)}"
        print(error_msg)
        return error_msg


# Create Gradio interface
gradio_interface = gr.Interface(
    fn=transcribe_audio,
    inputs=[
        gr.Audio(sources=["microphone", "upload"], type="numpy", label="Audio Input"),
        gr.Checkbox(value=True, label="Apply Grammar Correction")
    ],
    outputs=gr.Textbox(label="Transcription", lines=5),
    title="Bangla Speech-to-Text API with Grammar Correction",
    description="Upload audio or record to transcribe Bangla speech. Grammar correction is applied by default.",
    examples=None,
    api_name="transcribe"
)

# Launch with share=True
app, local_url, share_url = gradio_interface.queue().launch(
    share=True,
    auth=("deepthinkers", "bangla2025"),
    show_api=True,
    debug=True  # Show errors in Colab
)

# Print API usage instructions
print("\n" + "="*60)
print("API ENDPOINT INFORMATION")
print("="*60)
print(f"\nBase URL: {share_url}")
print(f"\nCorrect API Endpoint: {share_url}/call/transcribe")
print("\n⚠️  IMPORTANT: Use /call/transcribe NOT /api/transcribe")
print("\nUsage with Python:")
print("-"*60)
print("""
from gradio_client import Client, handle_file

client = Client("{}", auth=("deepthinkers", "bangla2025"))

# With grammar correction (default)
result = client.predict(
    audio_data=handle_file("path/to/audio.wav"),
    apply_correction=True,
    api_name="/transcribe"
)
print(result)

# Without grammar correction
result = client.predict(
    audio_data=handle_file("path/to/audio.wav"),
    apply_correction=False,
    api_name="/transcribe"
)
print(result)
""".format(share_url))
print("="*60)

# Install gradio_client if needed
print("\nTo install gradio_client:")
print("pip install gradio-client")
print("\nModels loaded:")
print("1. ASR: asif00/whisper-bangla")
print("2. Grammar Correction: asif00/mbart_bn_error_correction")

Loading Bangla Whisper model...


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

Device set to use cuda


ASR model loaded on cuda
Loading grammar correction model...


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/226 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/992 [00:00<?, ?B/s]

Device set to use cuda


Grammar correction model loaded on cuda
Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://88a06ccbe8e9fdd60e.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Transcribing audio...
Error: We expect a numpy ndarray or torch tensor as input, got `<class 'NoneType'>`


`return_token_timestamps` is deprecated for WhisperFeatureExtractor and will be removed in Transformers v5. Use `return_attention_mask` instead, as the number of frames can be inferred from it.


Transcribing audio...
Raw transcription: প্রানাম কী?
Applying grammar correction...
Corrected text: প্রানাম কী ?
Transcribing audio...
Raw transcription: তোমার নামকী।
Applying grammar correction...
Corrected text: তোমার নামকী ।
Transcribing audio...
Raw transcription: চ্যাটজি পিটিকিভাবে কাঁচ করে।
Applying grammar correction...
Corrected text: চ্যাটজি পিটিকিভাবে কাঁচা করে ।
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://88a06ccbe8e9fdd60e.gradio.live

API ENDPOINT INFORMATION

Base URL: https://88a06ccbe8e9fdd60e.gradio.live

Correct API Endpoint: https://88a06ccbe8e9fdd60e.gradio.live/call/transcribe

⚠️  IMPORTANT: Use /call/transcribe NOT /api/transcribe

Usage with Python:
------------------------------------------------------------

from gradio_client import Client, handle_file

client = Client("https://88a06ccbe8e9fdd60e.gradio.live", auth=("deepthinkers", "bangla2025"))

# With grammar correction (default)
result = client.predict(