###  Install Required Libraries

This cell installs all the necessary Python libraries for:
- Audio processing (`ffmpeg-python`)
- Transcription (`openai-whisper`)
- Text-to-speech synthesis (`TTS`)
- Evaluation metrics (`jiwer` for WER/CER, `pesq` for audio quality)

These tools will be used throughout the notebook for transcription, translation, audio generation, and evaluation.

In [None]:
!pip install ffmpeg-python
!pip install git+https://github.com/openai/whisper.git
!pip install TTS
!pip install jiwer pesq

Collecting ffmpeg-python
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Installing collected packages: ffmpeg-python
Successfully installed ffmpeg-python-0.2.0
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-puqu5h23
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-puqu5h23
  Resolved https://github.com/openai/whisper.git to commit 517a43ecd132a2089d85f4ebc044728a71d49f6e
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper==20240930)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->openai-whisper==202409

### Mount Google Drive

Mount Google Drive to access the lecture video, audio files, and save intermediate outputs such as transcriptions, translated text, and synthesized audio.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Extract Audio from Lecture Video

Use `ffmpeg` to extract the audio stream from the lecture video file (`Lecture.mp4`).  
The output is saved as a WAV file and used for transcription in the next step.

In [None]:
import ffmpeg

video_file_path = '/content/drive/MyDrive/Speech Q1/Lecture.mp4'
output_audio_path = '/content/drive/MyDrive/Speech Q1/extracted_audio.wav'

try:
    (
        ffmpeg
        .input(video_file_path)
        .output(output_audio_path, acodec='libmp3lame')
        .overwrite_output()
        .run(capture_stdout=True, capture_stderr=True)
    )
    print(f"Audio extracted and saved to: {output_audio_path}")
except ffmpeg.Error as e:
    print(f"An error occurred: {e.stderr.decode()}")

Audio extracted and saved to: /content/drive/MyDrive/Speech Q1/extracted_audio.wav


### Transcribe Lecture Using Whisper

Use the `whisper-large` model to transcribe the extracted audio from the lecture.  
This model handles **code-switching** between English and Hindi and produces accurate multi-language transcriptions.  
The output is saved to `transcription.txt` for later processing.


In [None]:
import whisper
import torch

model = whisper.load_model("large")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)


audio_file_path = '/content/drive/MyDrive/Speech Q1/extracted_audio.wav'

result = model.transcribe(audio_file_path)

print("Transcription:", result['text'])

with open('/content/drive/MyDrive/Speech Q1/transcription.txt', 'w') as f:
    f.write(result['text'])

Transcription:  We have been talking about this audio, audio processing with respect to speaker recognition, speech recognition, or any such related task. But you also know that with any of these security technologies, there can be attacks or there can be people who have any kind of ill intention who would like to defraud the system, who would like to fool the system. Have you heard of any such examples, any such real world examples where any kind of security system is in place, be it biometrics, face? Voice, any of those. And there have been cases where these systems have been fooled. Anybody remembers any such instance? Would like to share. Someone, you have to speak up. Is he going to answer? OK, all right. So there have been several such instances, not only. Only in I'm audible, right? Somebody, please speak up. I'm not able to hear you guys to see the text and all. OK, so there have been several such instances with respect to different kinds of. Tasks, automation tasks that we hav

### Remove Filler Words from Transcription

Clean the transcribed text by removing common filler words (e.g., "um", "uh", "like", "you know") using regular expressions.  
This improves readability and makes the translation and synthesis steps more natural and concise.  
The cleaned output is saved as `cleaned_transcription.txt`.


In [None]:
import re

def remove_filler_words(text):
    if not text:
        return ""

    filler_pattern = r'\b(um+|uh+|like|you know|basically|actually|so+|right|okay|I mean|just|well|hmm+)\b'

    cleaned = re.sub(filler_pattern, '', text, flags=re.IGNORECASE)

    cleaned = re.sub(r'\s+', ' ', cleaned).strip()

    return cleaned

with open('/content/drive/MyDrive/Speech Q1/transcription.txt', 'r') as f:
    raw_text = f.read()

cleaned_text = remove_filler_words(raw_text)

cleaned_path = '/content/drive/MyDrive/Speech Q1/cleaned_transcription.txt'
with open(cleaned_path, 'w') as f:
    f.write(cleaned_text)

print(f" Cleaned transcription saved to: {cleaned_path}")

 Cleaned transcription saved to: /content/drive/MyDrive/Speech Q1/cleaned_transcription.txt


### Translate Cleaned Transcription to Bengali

Use the `facebook/nllb-200-distilled-600M` multilingual translation model to translate the cleaned English-Hindi transcription into **Bengali**, a low-resource language.  
The result is saved to `bengali_translated.txt` for use in audio synthesis.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model_name = "facebook/nllb-200-distilled-600M"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

translator = pipeline("translation", model=model, tokenizer=tokenizer, src_lang="eng_Latn", tgt_lang="ben_Beng", max_length=512)

with open('/content/drive/MyDrive/Speech Q1/cleaned_transcription.txt', 'r', encoding='utf-8') as f:
    english_text = f.read()

translated = translator(english_text)
bengali_text = translated[0]['translation_text']

bengali_path = '/content/drive/MyDrive/Speech Q1/bengali_translated.txt'
with open(bengali_path, 'w', encoding='utf-8') as f:
    f.write(bengali_text)

print(f" Bengali translation saved to: {bengali_path}")

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Device set to use cuda:0
Token indices sequence length is longer than the specified maximum sequence length for this model (4605 > 1024). Running this sequence through the model will result in indexing errors
Your input_length: 4605 is bigger than 0.9 * max_length: 512. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)


 Bengali translation saved to: /content/drive/MyDrive/Speech Q1/bengali_translated.txt


### Chunk and Translate Cleaned Text to Bengali (Final Translation)

Split the cleaned English transcription into smaller sentence-based chunks to prevent token overflow errors and improve translation accuracy.  
Each chunk is translated using the `nllb-200` model and the results are combined to form the final Bengali version.  
The complete translated text is saved as `bengali_translated_final.txt`.

In [None]:
import re
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
translator = pipeline("translation", model=model, tokenizer=tokenizer, src_lang="eng_Latn", tgt_lang="ben_Beng", max_length=512)

with open('/content/drive/MyDrive/Speech Q1/cleaned_transcription.txt', 'r', encoding='utf-8') as f:
    english_text = f.read()

paragraphs = re.split(r'(?<=[.?!])\s+', english_text)
chunks = []
temp_chunk = ""

for sentence in paragraphs:
    if len(temp_chunk) + len(sentence) < 450:
        temp_chunk += sentence + " "
    else:
        chunks.append(temp_chunk.strip())
        temp_chunk = sentence + " "
if temp_chunk:
    chunks.append(temp_chunk.strip())

bengali_chunks = []
for i, chunk in enumerate(chunks):
    try:
        result = translator(chunk)
        bengali_chunks.append(result[0]['translation_text'])
        print(f" Translated chunk {i+1}/{len(chunks)}")
    except Exception as e:
        print(f" Error in chunk {i+1}: {e}")
        bengali_chunks.append("")

final_bengali_text = "\n\n".join(bengali_chunks)

bengali_path = "/content/drive/MyDrive/Speech Q1/bengali_translated_final.txt"
with open(bengali_path, "w", encoding="utf-8") as f:
    f.write(final_bengali_text)

print(f"\n Final Bengali translation saved to:\n{bengali_path}")

Device set to use cuda:0


 Translated chunk 1/36
 Translated chunk 2/36
 Translated chunk 3/36
 Translated chunk 4/36


Your input_length: 883 is bigger than 0.9 * max_length: 512. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)


 Translated chunk 5/36
 Translated chunk 6/36
 Translated chunk 7/36
 Translated chunk 8/36
 Translated chunk 9/36
 Translated chunk 10/36
 Translated chunk 11/36
 Translated chunk 12/36
 Translated chunk 13/36
 Translated chunk 14/36
 Translated chunk 15/36
 Translated chunk 16/36
 Translated chunk 17/36
 Translated chunk 18/36
 Translated chunk 19/36
 Translated chunk 20/36
 Translated chunk 21/36
 Translated chunk 22/36
 Translated chunk 23/36
 Translated chunk 24/36
 Translated chunk 25/36
 Translated chunk 26/36
 Translated chunk 27/36
 Translated chunk 28/36
 Translated chunk 29/36
 Translated chunk 30/36
 Translated chunk 31/36
 Translated chunk 32/36
 Translated chunk 33/36
 Translated chunk 34/36
 Translated chunk 35/36
 Translated chunk 36/36

 Final Bengali translation saved to:
/content/drive/MyDrive/Speech Q1/bengali_translated_final.txt


###  List Available Speakers for Voice Cloning

Print the list of supported speaker identities from the loaded TTS model.  
This helps in selecting a valid speaker ID (if using a multi-speaker model) for speech synthesis.


In [None]:
print("Available speakers:", tts.speakers)

Available speakers: ['female-en-5', 'female-en-5\n', 'female-pt-4\n', 'male-en-2', 'male-en-2\n', 'male-pt-3\n']


### Generate Bengali Audio Using gTTS

Use `gTTS` (Google Text-to-Speech) to synthesize Bengali speech from the translated lecture text.  
Although it uses a generic voice (not personalized), `gTTS` supports Bengali (`lang='bn'`) and provides fluent, high-quality audio output.  
The resulting MP3 file is saved as `bengali_speech_gTTS.mp3`.

In [None]:
!pip install gTTS -q
from gtts import gTTS
import os

with open("/content/drive/MyDrive/Speech Q1/bengali_translated_final.txt", "r", encoding="utf-8") as f:
    bengali_text = f.read()

tts = gTTS(text=bengali_text, lang='bn')
output_path = "/content/drive/MyDrive/Speech Q1/bengali_speech_gTTS.mp3"
tts.save(output_path)

print(f" Bengali audio saved to: {output_path}")

 Bengali audio saved to: /content/drive/MyDrive/Speech Q1/bengali_speech_gTTS.mp3


### Download and Convert Assamese Voice Sample (My Voice)

Download the recorded voice sample (`my_voice_sample.mp3`) from Google Drive using `gdown`,  
and convert it to a WAV file (`my_voice_sample.wav`) with the required format (22,050 Hz, mono, PCM 16-bit) using `ffmpeg`.  
This WAV file will be used as the reference voice for cloning with `YourTTS`.

In [None]:
!gdown --id 1hNpZq7SAOtROtjv-fcc3aQHH17tXDNpd -O my_voice_sample.mp3
!ffmpeg -i my_voice_sample.mp3 -ar 22050 -ac 1 -c:a pcm_s16le my_voice_sample.wav

Downloading...
From: https://drive.google.com/uc?id=1hNpZq7SAOtROtjv-fcc3aQHH17tXDNpd
To: /content/my_voice_sample.mp3
100% 2.10M/2.10M [00:00<00:00, 163MB/s]
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable

### Load YourTTS Model for Voice Cloning

Load the `your_tts` multilingual, multi-speaker TTS model using the Coqui TTS API.  
This model allows speaker adaptation from a reference voice sample, enabling personalized speech synthesis in different languages.


In [None]:
from TTS.api import TTS

tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", gpu=True)



 > tts_models/multilingual/multi-dataset/your_tts is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-

### Define Assamese Lecture Text for Synthesis

This is the translated version of a lecture segment in **Assamese**,  
based on the original English-Hindi content from the transcription.  
This text will be used to generate speech using the speaker embedding from your recorded voice sample.


In [None]:
assamese_text = (
    "আমি এই অডিঅ' প্ৰচেছিং, স্পীকাৰ চিনাক্তকৰণ, স্পীচ চিনাক্তকৰণ আৰু এনেধৰণৰ অন্যান্য টাস্কৰ বিষয়ে কথা পাতি আছো। "
    "কিন্তু আপোনালোকে জানেই যে এইবোৰ সিকিউৰিটি টেকন’লজীত কেতিয়াবা আক্ৰমণ হ’ব পাৰে, বা কিছুমান মানুহ থাকিব পাৰে "
    "যি ক্ৰিয়াকলাপক ঠগিবলৈ চেষ্টা কৰিব, প্ৰণালীটো বেয়া কৰাৰ চেষ্টা কৰিব। "
    "আপোনালোকে এনেধৰণৰ উদাহৰণ শুনিছে নে? এনে বাস্তৱ উদাহৰণ, য’ত কোনো সিকিউৰিটি ছিষ্টেম ব্যৱহাৰ হৈছে — বায়’মেট্ৰিক্স, মুখ, কণ্ঠ — এনে কিবা এটা। "
    "আৰু এনে বহু ঘটনা হৈছে য’ত এই ছিষ্টেমবোৰ ঠগা হৈছে। কোনোৱে এনে উদাহৰণ মনত ৰাখে নে? ভাগ দিয়ক। "
    "কোনোৱে, আপুনি ক’ব লাগিব। তেওঁ উত্তৰ দিব নে? ঠিক আছে। এনে বহু ঘটনা আছে। কেৱল মই শুনা গৈছো বুলি নহয়। "
    "মই শুনিছোঁ নে? কোনোৱে, অনুগ্ৰহ কৰি ক’ব।"
)

### Attempt Voice Cloning in Assamese Using YourTTS (Script Unsupported)

Use `YourTTS` to synthesize Assamese audio using a speaker embedding from a real recorded voice sample.  
Although the model accepts a speaker WAV file, it does **not support Assamese script**, causing most characters to be discarded.  
The resulting audio file `your_voice_assamese_output.wav` was generated, but is likely incomplete or incorrect due to script incompatibility.

This experiment demonstrates the model's limitation in handling certain Indian languages with native scripts.

In [None]:
tts.tts_to_file(
    text=assamese_text,
    speaker_wav="my_voice_sample.wav",
    language="en",
    file_path="your_voice_assamese_output.wav"
)

 > Text splitted to sentences.
["আমি এই অডিঅ' প্ৰচেছিং, স্পীকাৰ চিনাক্তকৰণ, স্পীচ চিনাক্তকৰণ আৰু এনেধৰণৰ অন্যান্য টাস্কৰ বিষয়ে কথা পাতি আছো। কিন্তু আপোনালোকে জানেই যে এইবোৰ সিকিউৰিটি টেকন’লজীত কেতিয়াবা আক্ৰমণ হ’ব পাৰে, বা কিছুমান মানুহ থাকিব পাৰে যি ক্ৰিয়াকলাপক ঠগিবলৈ চেষ্টা কৰিব, প্ৰণালীটো বেয়া কৰাৰ চেষ্টা কৰিব। আপোনালোকে এনেধৰণৰ উদাহৰণ শুনিছে নে?", 'এনে বাস্তৱ উদাহৰণ, য’ত কোনো সিকিউৰিটি ছিষ্টেম ব্যৱহাৰ হৈছে — বায়’মেট্ৰিক্স, মুখ, কণ্ঠ — এনে কিবা এটা। আৰু এনে বহু ঘটনা হৈছে য’ত এই ছিষ্টেমবোৰ ঠগা হৈছে। কোনোৱে এনে উদাহৰণ মনত ৰাখে নে?', 'ভাগ দিয়ক। কোনোৱে, আপুনি ক’ব লাগিব। তেওঁ উত্তৰ দিব নে?', 'ঠিক আছে। এনে বহু ঘটনা আছে। কেৱল মই শুনা গৈছো বুলি নহয়। মই শুনিছোঁ নে?', 'কোনোৱে, অনুগ্ৰহ কৰি ক’ব।']
আমি এই অডিঅ' প্ৰচেছিং, স্পীকাৰ চিনাক্তকৰণ, স্পীচ চিনাক্তকৰণ আৰু এনেধৰণৰ অন্যান্য টাস্কৰ বিষয়ে কথা পাতি আছো। কিন্তু আপোনালোকে জানেই যে এইবোৰ সিকিউৰিটি টেকন’লজীত কেতিয়াবা আক্ৰমণ হ’ব পাৰে, বা কিছুমান মানুহ থাকিব পাৰে যি ক্ৰিয়াকলাপক ঠগিবলৈ চেষ্টা কৰিব, প্ৰণালীটো বেয়া কৰাৰ চেষ্টা কৰিব। আপোনালোকে এনেধৰণৰ উদা

'your_voice_assamese_output.wav'

### Audio Preprocessing: Downsampling to 16kHz for PESQ Compatibility

The PESQ (Perceptual Evaluation of Speech Quality) metric supports only audio sampled at 8000 Hz (narrow-band) or 16000 Hz (wide-band). Since our original audio files were recorded or synthesized at a different sampling rate (e.g., 22050 Hz or 48000 Hz), we downsample them to 16 kHz mono format using `ffmpeg`.

This ensures compatibility with the PESQ tool and maintains a standard format for evaluating speech quality.

In [None]:
!ffmpeg -i /content/my_voice_sample.wav -ar 16000 -ac 1 /content/my_voice_sample_16k.wav
!ffmpeg -i /content/your_voice_assamese_output.wav -ar 16000 -ac 1 /content/your_voice_assamese_output_16k.wav

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

### Evaluation: PESQ and Estimated MOS for Audio Quality
This section computes the PESQ (Perceptual Evaluation of Speech Quality) score between the original recorded voice and the synthesized output using the YourTTS model. PESQ is standardized by ITU-T and provides an objective metric of perceived quality.

We also heuristically estimate the MOS (Mean Opinion Score), based on the PESQ score, to assess the human-perceived quality of the generated speech.

In [None]:
from scipy.io import wavfile
from pesq import pesq

ref_path = "/content/my_voice_sample_16k.wav"
deg_path = "/content/your_voice_assamese_output_16k.wav"

rate_ref, ref = wavfile.read(ref_path)
rate_deg, deg = wavfile.read(deg_path)

assert rate_ref == rate_deg == 16000, "PESQ requires 16 kHz or 8 kHz audio."

pesq_score = pesq(rate_ref, ref, deg, 'wb')
print(f"PESQ Score: {pesq_score:.2f}")

def pesq_to_mos(score):
    if score >= 4.5: return 5.0
    elif score >= 4.0: return 4.5
    elif score >= 3.6: return 4.0
    elif score >= 3.1: return 3.5
    elif score >= 2.6: return 3.0
    elif score >= 2.1: return 2.5
    else: return 2.0

mos = pesq_to_mos(pesq_score)
print(f"Estimated MOS Score: {mos:.2f}")

PESQ Score: 2.96
Estimated MOS Score: 3.00


###  Evaluate Transcription Accuracy (WER & CER)

Calculate the **Word Error Rate (WER)** and **Character Error Rate (CER)** to measure how accurate the Whisper transcription is.  
The reference is the original output from Whisper, and the hypothesis is the cleaned version after removing filler words.  
This helps quantify the impact of preprocessing on transcription quality.

In [None]:
from jiwer import wer, cer

with open("/content/drive/MyDrive/Speech Q1/transcription.txt", "r") as ref_f:
    reference = ref_f.read()

with open("/content/drive/MyDrive/Speech Q1/cleaned_transcription.txt", "r") as hyp_f:
    hypothesis = hyp_f.read()

wer_score = wer(reference, hypothesis)
cer_score = cer(reference, hypothesis)

print(f" Word Error Rate (WER): {wer_score:.4f}")
print(f" Character Error Rate (CER): {cer_score:.4f}")

 Word Error Rate (WER): 0.0851
 Character Error Rate (CER): 0.0639
