<a href="https://colab.research.google.com/github/sahilaf/Test_models/blob/main/bangla_local_assistant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ===========================================
# 🔊 Bangla Voice Assistant with Gemma 3 + Edge-TTS
# ===========================================

# --- Install dependencies ---

In [11]:
from huggingface_hub import login
login("Access_token")



In [8]:
!pip install edge-tts gradio transformers torch torchaudio spaces git+https://github.com/openai/whisper.git --quiet

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [20]:
!pip install nest_asyncio



In [21]:
import nest_asyncio, asyncio, time, edge_tts, torch, IPython.display as ipd
from transformers import pipeline
from google.colab import files

In [22]:
nest_asyncio.apply()

In [12]:
# -----------------------------
# 1. Load Models
# -----------------------------
asr = pipeline(model="asif00/whisper-bangla")  # Bangla Speech-to-Text
ser = pipeline("text2text-generation", model="asif00/mbart_bn_error_correction")  # Grammar correction

gemma_pipe = pipeline(
    "image-text-to-text",
    model="google/gemma-3-4b-it",
    torch_dtype=torch.bfloat16,
    device="cuda" if torch.cuda.is_available() else "cpu",
)

Device set to use cuda:0
Device set to use cuda:0


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Device set to use cuda


In [25]:
# -----------------------------
# 2. Async Bangla TTS
# -----------------------------
async def bangla_tts(text, voice="bn-BD-PradeepNeural", rate=0, pitch=0):
    timestamp = time.strftime("%Y%m%d_%H%M%S")
    output_path = f"/content/bangla_reply_{timestamp}.mp3"
    communicate = edge_tts.Communicate(
        text, voice, rate=f"{rate:+d}%", pitch=f"{pitch:+d}Hz"
    )
    await communicate.save(output_path)
    return output_path

def run_async(coro):
    """Run async function safely inside Colab."""
    loop = asyncio.get_event_loop()
    return loop.run_until_complete(coro)

# -----------------------------
# 3. Voice→Text→LLM→Voice Pipeline
# -----------------------------
def full_conversation(audio_path):
    print("🎙 Transcribing...")
    text = asr(audio_path)["text"]
    print(f"📝 Raw transcription: {text}")

    corrected_text = ser(text)[0]["generated_text"]
    print(f"✅ Corrected text: {corrected_text}")

    print("🤖 Generating AI reply...")
    messages = [
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "তুমি একজন বন্ধুত্বপূর্ণ, সংক্ষিপ্ত এবং সম্পূর্ণ বাংলা ভাষাভাষী সহকারী। "
                        "কখনো কোনো ইংরেজি শব্দ, সংখ্যা, বা প্রতীক ব্যবহার করবে না। "
                        "তোমার উত্তর সর্বোচ্চ দুই বাক্যের মধ্যে সীমাবদ্ধ রাখবে। "
                        "শুধু সাধারণ বাংলায় কথা বলবে যেন মনে হয় একজন মানুষ কথা বলছে।"
                    )
                }
            ],
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": corrected_text}],
        },
    ]

    output = gemma_pipe(text=messages, max_new_tokens=120)
    reply = output[0]["generated_text"][-1]["content"]

    # Remove any stray English words or characters for clean TTS
    import re
    reply = re.sub(r"[A-Za-z0-9@#\$%\^\&\*\(\)\[\]\{\};:!~`_+=<>/\\|]", "", reply)
    reply = " ".join(reply.split())  # Clean up spacing
    if len(reply.split()) > 50:
        reply = " ".join(reply.split()[:50]) + "।"

    print(f"\n💬 সহায়কের সংক্ষিপ্ত উত্তর: {reply}")

    print("🔊 Generating speech...")
    tts_path = run_async(bangla_tts(reply))
    print(f"✅ Saved: {tts_path}")

    return reply, ipd.Audio(tts_path)


In [28]:
uploaded = files.upload()  # Upload a Bangla .wav or .mp3 file
audio_path = list(uploaded.keys())[0]

reply, audio_out = full_conversation(audio_path)
display(audio_out)


Saving Recording (2).mp3 to Recording (2) (1).mp3
🎙 Transcribing...
📝 Raw transcription: তুমি নিজের সম্পর্কে কিছু বল, তুমি কে কি করে।
✅ Corrected text: তুমি নিজের সম্পর্কে কিছু বলো ।
🤖 Generating AI reply...

💬 সহায়কের সংক্ষিপ্ত উত্তর: আমি একটি কম্পিউটার প্রোগ্রাম, তবে আমি মানুষের মতো করে বাংলা বলতে ও উত্তর দিতে চেষ্টা করি। তোমাদের যেকোনো বিষয়ে সাহায্য করতে আমি সবসময় প্রস্তুত।
🔊 Generating speech...
✅ Saved: /content/bangla_reply_20251006_190810.mp3
