In [4]:
from IPython.display import HTML

HTML('''
<iframe width="720" height="380" src="https://www.youtube.com/embed/xTt9PC4MPuQ" frameborder="0" allowfullscreen></iframe>
''')

**Problem Overview:**
Ghost results in speech-to-text systems refer to transcriptions that are inaccurate or fabricated and do not correspond to the actual spoken words. These errors are problematic because they diminish the accuracy and reliability of the transcription system. This leads to poor user experiences, especially in high-stakes environments such as legal transcription, medical records or customer service.
The challenge here is to mitigate ghost results so as to enhance the overall performance of speech-to-text system.

**Objective:**
To implement strategies that reduce ghost results, provide an explanation of the chosen methods, and demonstrate their effectiveness through examples or quantitative metrics.

**Causes of Ghost Results:**
Ghost results typically arise due to several key factors:

*  **Background Noise and Interference:**Unfiltered ambient sounds can confuse the speech-to-text system.
*  **Accents and Dialects:** The system might struggle to recognize regional accents or dialects leading to incorrect transcription.
*  **Homophones and Word Ambiguity:** Words that sound similar can be easily confused.
* **Acoustic Model Limitations:** Weak or under-trained models are unable to interpret certain sounds correctly especially if they have not been exposed to diverse datasets.

**Strategies for Counteracting Ghost Results:**

*  **Noise Filteration and Preprocessing:**
Use sophisticated noise suppression techniques in order to lessen the effect of background sounds. This may include adaptive noise cancelation algorithms that exclude ambient noises while still allowing the dominant voice signal through.
Acoustic beamforming techniques can also focus the system on the primary speaker while reducing the influence of off-axis noise.

*  **Training with a Variety of Datasets:** Train your model using a large and hardworking dataset comprising different dialects, accents and background noise conditions for improved robustness against real-world variability. Utilize transfer learning from already rich pre-trained models which have previously been exposed to varying datasets.

*  **More Comprehensive Acoustic and Language Models**: The phonetic units in an acoustic model can be made finer so that any mispronunciation arising out of accents does not lead to too many errors occurring during word identification.
Additionally, context awarelanguage models can enable better understanding of context thus helping in correcting errors attributed to homonyms (two words that are pronounced alike but mean different things) and other ambiguities in our vernaculars; hence modern language model like transformer-based ones (for example BERT or GPT) have been able to comprehend language with more understanding as well as maintain some degree of coherence between words within sentences.

*  **Confidense Scoring and Post-processing:** Confidence scoring may be used to check the probability that a transcription is accurate. If the system is not sure about transcription, it may indicate the result that requires a human verification or try to reconstruct the text using surrounding context.
Auto-correction algorithms can cross-check transcriptions against language model predictions so as to arrange words or phrases with low confidence automatically.

*  **Feedback and Adaptation in Real Time:** Dynamically incorporate user or speaker error corrections into real time feedback loops where the machine learning system alters its knowledge each time an error occurs. Over time, this incremental learning leads to improvement in system performance.

***In this appraoch, I used noise reduction to remove ghost results from the audio file.***

***transcription with noisy audio***

In [None]:
from google.colab import files
from IPython.display import Audio

# Step 1: Upload the audio file
uploaded = files.upload()

# Step 2: Load and play the uploaded file (assume the file is named "audio.aac")
audio_path = list(uploaded.keys())[0]

# Step 3: Play the audio
Audio(audio_path, autoplay=True)


Saving test_audio.aac to test_audio.aac


In [None]:
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load pre-trained model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")

# Load audio file
audio_input, _ = torchaudio.load('test_audio.aac')

# Resample the audio to the required sampling rate (Whisper models expect 16000 Hz)
resampler = torchaudio.transforms.Resample(orig_freq=_, new_freq=16000)
audio_input = resampler(audio_input)

# Preprocess audio for the model
input_features = processor(audio_input.squeeze().numpy(), return_tensors="pt", sampling_rate=16000).input_features

# Generate transcription using the Whisper model
generated_ids = model.generate(input_features)

# Decode the transcription
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("Initial Transcription: ", transcription)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Initial Transcription:   Assigning confidence levels to restrictions and filtering out low confidence segments.


***transcription after denoising the audio***

In [None]:
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load pre-trained model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")

# Load audio file
audio_input, sample_rate = torchaudio.load('test_audio.aac')

# Resample the audio to the required sampling rate (Whisper models expect 16000 Hz)
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
audio_input_resampled = resampler(audio_input)

# Band-pass filter: Remove low-frequency (<300 Hz) and high-frequency (>3400 Hz) noise
low_freq = 300
high_freq = 3400
bandpass_filter = torchaudio.functional.bandpass_biquad(audio_input_resampled, sample_rate=16000, central_freq=(low_freq + high_freq) / 2, Q=1)
filtered_audio = bandpass_filter

# Normalize the gain (optional but can help with low-volume audio)
gain_normalizer = torchaudio.transforms.Vol(1.0)
normalized_audio = gain_normalizer(filtered_audio)

# Preprocess audio for the model
input_features = processor(normalized_audio.squeeze().numpy(), return_tensors="pt", sampling_rate=16000).input_features

# Generate transcription using the Whisper model
generated_ids = model.generate(input_features)

# Decode the transcription
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("Improved Transcription: ", transcription)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Improved Transcription:   Assigning confidence levels to transcription and filtering out low confidence segments.
