In [None]:
!nvidia-smi

Mon May 13 11:12:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P0              29W /  70W |    709MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 🌟 Dependencies Used:

1. **noisereduce**: library in Python is utilized for noise reduction in audio files.
2. **openai-whisper**: OpenAI's Whisper is a powerful open-source speech recognition system that can transcribe speech in multiple languages and translate non-English speech to English.
3. **whisperx**: library is a Python package that provides fast and accurate automatic speech recognition (ASR) with word-level timestamps and speaker diarization. It is built on top of OpenAI's Whisper model and uses CTranslate2 for faster inference.
4. **pydub**: pydub is a high-level audio manipulation library that provides a simple and easy-to-use interface.
5. **pysrt**: pysrt is a library used for parsing and editing SubRip (.srt) subtitle files.


In [None]:
!pip install noisereduce openai-whisper git+https://github.com/m-bain/whisperx.git git+https://github.com/openai/whisper.git pydub pysrt

Collecting git+https://github.com/m-bain/whisperx.git
  Cloning https://github.com/m-bain/whisperx.git to /tmp/pip-req-build-5o2ywfc0
  Running command git clone --filter=blob:none --quiet https://github.com/m-bain/whisperx.git /tmp/pip-req-build-5o2ywfc0
  Resolved https://github.com/m-bain/whisperx.git to commit f2da2f858e99e4211fe4f64b5f2938b007827e17
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-yqcb3_k_
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-yqcb3_k_
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


🌟
1. librosa: to load the audio MP3 files and extract features if needed.
2. soundfile to save the denoised audio after noise reduction.
3. noisereduce to apply noise reduction techniques to clean up the audio.
4. Whisper library for speech-to-text transcription of the denoised audio.
5. pysrt to parse, edit, and save the subtitles in SubRip (.srt) format.
6. pydub for additional audio manipulation tasks.

In [None]:
import librosa
import soundfile as sf
import noisereduce as nr
import whisper
from pydub import AudioSegment
import pysrt
import os

### 🌟 loading the yt video mp3 file (.mp3) converted from mp4->mp3

In [None]:
mp3_path = '/content/drive/My Drive/aud.mp3'

In [None]:
# Load the noisy audio file
noisy_audio, sr = librosa.load(mp3_path, sr=None)

In [None]:
# Apply noise reduction
denoised_audio = nr.reduce_noise(y=noisy_audio, sr=sr)

In [None]:
# Save denoised audio
sf.write('/content/drive/My Drive/denoised_audio2.mp3', denoised_audio, sr)

### 🌟 Reasons to Use Whisper Base Model for Speech-to-Text
1. Robust Performance: The Whisper base model is trained on a large and diverse dataset of 680,000 hours of audio, making it robust to accents, background noise, and technical language.
2. Multilingual Support: Whisper supports transcription in 99 different languages, allowing you to handle a wide range of speech inputs.
3. Open-Source and Free: Unlike other speech recognition models, Whisper is open-source and freely available, reducing the cost and barriers to entry.

### 🌟 Advantages of Whisper Base Model
1. Strong Generalization:.
2. Multitasking Capabilities.

#### 🌟 The transcribe() function will load the audio file, preprocess it, and pass it through the loaded Whisper model to generate a transcription.

In [None]:
model = whisper.load_model("base")
result = model.transcribe("/content/drive/My Drive/denoised_audio2.mp3")

In [None]:
with open("/content/drive/My Drive/aud_transcription.txt", "w") as f:
    f.write(result["text"])

### 🌟 Command Explanation ( for Time-aligning)

This command utilizes the `whisperx` tool to process audio files in a semantic chunking project, specifically for converting YouTube videos into properly aligned transcriptions.

```bash
!whisperx /content/drive/MyDrive/denoised_audio.mp3 --model medium.en --output_dir . --align_model WAV2VEC2_ASR_LARGE_LV60K_960H
```

- **`!whisperx`**: Executes the `whisperx` tool for audio processing.
- **`/content/drive/MyDrive/denoised_audio.mp3`**: Path to the input audio file, located in Google Drive.
- **`--model medium.en`**: Specifies the model to be used for processing, a medium-sized English language model.
- **`--output_dir .`**: Sets the output directory to the current directory for saving processed files.
- **`--align_model WAV2VEC2_ASR_LARGE_LV60K_960H`**: Specifies the alignment model to be used, leveraging techniques from the WAV2VEC2 architecture.


In [None]:
!whisperx /content/drive/MyDrive/denoised_audio.mp3 --model medium.en --output_dir . --align_model WAV2VEC2_ASR_LARGE_LV60K_960H

2024-05-13 11:19:14.418725: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-13 11:19:14.418779: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-13 11:19:14.420180: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
  torchaudio.set_audio_backend("soundfile")
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.2.4. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/torch/whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might ha

## 🌟 Semantic Chunking Function

## Overview

The `semantic_chunking` function processes audio and subtitle files to perform semantic chunking, dividing the audio into smaller segments aligned with corresponding subtitles..

## Function Signature

```python
def semantic_chunking(audio_file, srt_file)
```

## Parameters:

audio_file (str): Path to the input audio file.
srt_file (str): Path to the input subtitle file in SRT format.

## 🌟 Returns:

1. chunks (list of dicts): A list containing dictionaries representing each chunk with the following keys:
2. chunk_id (int): Unique identifier for the chunk.
3. chunk_length (float): Length of the chunk in seconds.
4. text (str): Textual content corresponding to the chunk.
5. start_time (float): Start time of the chunk in seconds.
6. end_time (float): End time of the chunk in seconds.

In [None]:
# Function to perform semantic chunking and generate output in the specified format
def semantic_chunking(audio_file, srt_file):
    # Load audio
    audio = AudioSegment.from_file(audio_file)

    # Load subtitles
    subs = pysrt.open(srt_file)

    # Chunk duration in milliseconds (15 seconds)
    chunk_duration = 15 * 1000

    # Initialize start and end times
    start_time = 0
    end_time = chunk_duration

    chunks = []

    # Unique identifier for chunks
    chunk_id = 1

    # Segment audio and text
    for sub in subs:
        # Calculate end time of current subtitle
        sub_end_time = sub.end.to_time().hour * 3600 * 1000 + \
                       sub.end.to_time().minute * 60 * 1000 + \
                       sub.end.to_time().second * 1000 + \
                       sub.end.to_time().microsecond / 1000

        # Segment audio and text until end of subtitle
        while start_time < sub_end_time:
            # Ensure end time doesn't exceed end of subtitle
            if end_time > sub_end_time:
                end_time = sub_end_time

            # Extract audio chunk
            chunk_audio = audio[start_time:end_time]

            # Extract text chunk
            chunk_text = ''
            for line in sub.text.split('\n'):
                chunk_text += line + ' '

            # Calculate chunk length in seconds
            chunk_length = len(chunk_audio) / 1000

            # Append chunk to list
            chunks.append({
                "chunk_id": chunk_id,
                "chunk_length": chunk_length,
                "text": chunk_text.strip(),
                "start_time": start_time / 1000,
                "end_time": end_time / 1000
            })

            # Move to next chunk
            start_time = end_time
            end_time += chunk_duration

            # Increment chunk id
            chunk_id += 1

    return chunks

In [None]:
if __name__ == "__main__":
    # Paths to audio and srt files
    audio_file = "/content/drive/MyDrive/denoised_audio.mp3"
    srt_file = "/content/denoised_audio.srt"

    # Perform semantic chunking
    chunks = semantic_chunking(audio_file, srt_file)

    # Output chunks
    print(chunks)

[{'chunk_id': 1, 'chunk_length': 2.09, 'text': 'Congratulations to you Mr. Raghavan for that.', 'start_time': 0.0, 'end_time': 2.09}, {'chunk_id': 2, 'chunk_length': 1.34, 'text': 'Thank you so much for joining us.', 'start_time': 2.09, 'end_time': 3.43}, {'chunk_id': 3, 'chunk_length': 0.801, 'text': 'Over to you.', 'start_time': 3.43, 'end_time': 4.231}, {'chunk_id': 4, 'chunk_length': 5.101, 'text': 'Hi everybody.', 'start_time': 4.231, 'end_time': 9.332}, {'chunk_id': 5, 'chunk_length': 0.721, 'text': 'How are you?', 'start_time': 9.332, 'end_time': 10.053}, {'chunk_id': 6, 'chunk_length': 3.501, 'text': 'I am not hearing this at all.', 'start_time': 10.053, 'end_time': 13.554}, {'chunk_id': 7, 'chunk_length': 3.381, 'text': "It's like a post-lunch energy downer or something.", 'start_time': 13.554, 'end_time': 16.935}, {'chunk_id': 8, 'chunk_length': 1.781, 'text': "Let's hear it.", 'start_time': 16.935, 'end_time': 18.716}, {'chunk_id': 9, 'chunk_length': 1.841, 'text': 'Are you 

## 🌟 Chunks of whole YT Video specified, in order with higher precision.

In [None]:
chunks

[{'chunk_id': 1,
  'chunk_length': 2.09,
  'text': 'Congratulations to you Mr. Raghavan for that.',
  'start_time': 0.0,
  'end_time': 2.09},
 {'chunk_id': 2,
  'chunk_length': 1.34,
  'text': 'Thank you so much for joining us.',
  'start_time': 2.09,
  'end_time': 3.43},
 {'chunk_id': 3,
  'chunk_length': 0.801,
  'text': 'Over to you.',
  'start_time': 3.43,
  'end_time': 4.231},
 {'chunk_id': 4,
  'chunk_length': 5.101,
  'text': 'Hi everybody.',
  'start_time': 4.231,
  'end_time': 9.332},
 {'chunk_id': 5,
  'chunk_length': 0.721,
  'text': 'How are you?',
  'start_time': 9.332,
  'end_time': 10.053},
 {'chunk_id': 6,
  'chunk_length': 3.501,
  'text': 'I am not hearing this at all.',
  'start_time': 10.053,
  'end_time': 13.554},
 {'chunk_id': 7,
  'chunk_length': 3.381,
  'text': "It's like a post-lunch energy downer or something.",
  'start_time': 13.554,
  'end_time': 16.935},
 {'chunk_id': 8,
  'chunk_length': 1.781,
  'text': "Let's hear it.",
  'start_time': 16.935,
  'end_t