<a href="https://colab.research.google.com/github/satyam-52/speech-to-text/blob/main/speech_to_text_whisper_hindi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers torch torchaudio accelerate googletrans==4.0.0rc1

Collecting googletrans==4.0.0rc1
  Downloading googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans==4.0.0rc1)
  Downloading httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting hstspreload (from httpx==0.13.3->googletrans==4.0.0rc1)
  Downloading hstspreload-2025.1.1-py3-none-any.whl.metadata (2.1 kB)
Collecting chardet==3.* (from httpx==0.13.3->googletrans==4.0.0rc1)
  Downloading chardet-3.0.4-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting idna==2.* (from httpx==0.13.3->googletrans==4.0.0rc1)
  Downloading idna-2.10-py2.py3-none-any.whl.metadata (9.1 kB)
Collecting rfc3986<2,>=1.3 (from httpx==0.13.3->googletrans==4.0.0rc1)
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting httpcore==0.9.* (from httpx==0.13.3->googletrans==4.0.0rc1)
  Downloading httpcore-0.9.1-py3-none-any.whl.metadata (4.6 kB)
Collecting h11<0.10,>=0.8 (from httpcore==0.9.*->httpx==0.13.3->googletrans=

In [2]:
import torch
from transformers import pipeline
from google.colab import files
from googletrans import Translator
import IPython.display as ipd
import os

In [3]:
def initialize_models():
    """Initialize transcription model and Google Translator"""
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")

    # Transcription model
    print("Loading Whisper Hindi Large-v2 model...")
    transcribe = pipeline(
        task="automatic-speech-recognition",
        model="vasista22/whisper-hindi-large-v2",
        chunk_length_s=30,
        device=device
    )

    transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(
        language="hi",
        task="transcribe"
    )

    # Google Translator
    print("Initializing Google Translator...")
    translator = Translator()

    print("Models loaded successfully!")
    return transcribe, translator

In [4]:
def transcribe_and_translate_google(audio_path, transcribe_model, translator):
    """
    Transcribe Hindi audio and translate to English using Google Translate
    """
    try:
        # Transcribe
        print("Transcribing audio...")
        result = transcribe_model(audio_path)
        hindi_text = result["text"]

        # Translate using Google Translate
        print("Translating to English using Google Translate...")
        translation = translator.translate(hindi_text, src='hi', dest='en')
        english_text = translation.text

        return hindi_text, english_text

    except Exception as e:
        print(f"Error: {e}")
        return None, None

In [5]:
# Test the translation with your example
def test_translation():
    translator = Translator()
    test_text = "मेरा नाम सत्यम है"
    result = translator.translate(test_text, src='hi', dest='en')
    print(f"Test Hindi: {test_text}")
    print(f"Test English: {result.text}")

In [6]:
# Run test first
print("Testing translation accuracy:")
test_translation()

# Initialize models
transcriber, translator = initialize_models()

# Upload and process audio
print("\nPlease upload your Hindi audio file:")
uploaded = files.upload()
audio_file = list(uploaded.keys())[0]

# Process the audio
hindi_transcription, english_translation = transcribe_and_translate_google(
    audio_file, transcriber, translator
)

if hindi_transcription and english_translation:
    print("\n" + "="*60)
    print("RESULTS:")
    print("="*60)
    print("📝 HINDI TRANSCRIPTION:")
    print(hindi_transcription)
    print("\n🔄 ENGLISH TRANSLATION (Google Translate):")
    print(english_translation)
    print("="*60)
else:
    print("Processing failed!")


# Save Hindi transcription
hindi_filename = f"{audio_file}_hindi_transcription.txt"
with open(hindi_filename, 'w', encoding='utf-8') as f:
    f.write(hindi_transcription)

# Save English translation
english_filename = f"{audio_file}_english_translation.txt"
with open(english_filename, 'w', encoding='utf-8') as f:
    f.write(english_translation)

# Save combined results
combined_filename = f"{audio_file}_transcription_and_translation.txt"
with open(combined_filename, 'w', encoding='utf-8') as f:
    f.write("HINDI TRANSCRIPTION:\n")
    f.write("=" * 30 + "\n")
    f.write(hindi_transcription + "\n\n")
    f.write("ENGLISH TRANSLATION:\n")
    f.write("=" * 30 + "\n")
    f.write(english_translation + "\n")

print(f"\nFiles saved:")
print(f"- Hindi transcription: {hindi_filename}")
print(f"- English translation: {english_filename}")
print(f"- Combined file: {combined_filename}")

# Download the files
files.download(hindi_filename)
files.download(english_filename)
files.download(combined_filename)


Testing translation accuracy:
Test Hindi: मेरा नाम सत्यम है
Test English: My name is Satyam
Using device: cuda:0
Loading Whisper Hindi Large-v2 model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/6.17G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.17G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/300 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/832 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.11k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

Device set to use cuda:0


Initializing Google Translator...
Models loaded successfully!

Please upload your Hindi audio file:


Saving WhatsApp Ptt 2025-05-25 at 5.09.49 PM.ogg to WhatsApp Ptt 2025-05-25 at 5.09.49 PM.ogg
Transcribing audio...


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Translating to English using Google Translate...

RESULTS:
📝 HINDI TRANSCRIPTION:
मेरा नाम सत्यम है मैं लखनऊ में रहता हूँ

🔄 ENGLISH TRANSLATION (Google Translate):
My name is Satyam, I live in Lucknow

Files saved:
- Hindi transcription: WhatsApp Ptt 2025-05-25 at 5.09.49 PM.ogg_hindi_transcription.txt
- English translation: WhatsApp Ptt 2025-05-25 at 5.09.49 PM.ogg_english_translation.txt
- Combined file: WhatsApp Ptt 2025-05-25 at 5.09.49 PM.ogg_transcription_and_translation.txt


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>