

---





**Task: Semantic Chunking of a YouTube Video** 📹
- Dive into extracting meaningful audio-text pairs from a specific video. Show us your skill in achieving precise segmentation and alignment!




## Semantic Chunking of a Youtube Video

**Problem Statement:**

The objective is to extract high-quality, meaningful (semantic) segments from the specified YouTube video: [Watch Video](https://www.youtube.com/watch?v=Sby1uJ_NFIY).

Suggested workflow:
1. **Download Video and Extract Audio:** Download the video and separate the audio component.
2. **Transcription of Audio:** Utilize an open-source Speech-to-Text model to transcribe the audio. *Provide an explanation of the chosen model and any techniques used to enhance the quality of the transcription.*
3. **Time-Align Transcript with Audio:** *Describe the methodology and steps for aligning the transcript with the audio.*
4. **Semantic Chunking of Data:** Slice the data into audio-text pairs, using both semantic information from the text and voice activity information from the audio, with each audio-chunk being less than 15s in length. *Explain the logic used for semantic chunking and discuss the strengths and weaknesses of your approach.*

# Code Starts Here

# Detailed Explanations and Generalization of the Code

## Detailed Explanations
1. The code starts by defining a function `video2mp3` that converts a video file to an audio file using the FFmpeg library. It takes the video file path and the desired output extension as input and returns the path of the generated audio file.

2. The `download_video` function is defined to download a YouTube video using the `pytube` library. It takes the video URL as input, selects the highest resolution progressive stream, and downloads the video. It returns the path of the downloaded video file.

3. The `transcribe` function is defined to transcribe the audio file using the Whisper model. It loads the "large-v3" model, transcribes the audio, and returns the transcribed text.

4. The `align_transcript` function aligns the transcribed text with the audio using the `ctc-forced-aligner` library. It takes the audio file path and the transcript as input, writes the transcript to a temporary file, and runs the forced alignment using the specified parameters. It returns the path of the alignment output file.

5. The `time_to_seconds` function is a helper function that converts a time string in the format "HH:MM:SS" or "MM:SS" or "SS" to seconds.

6. The `parse_vad_file` function parses the Voice Activity Detection (VAD) output file and extracts the speech segments. It reads the file line by line, matches the start and end times using regular expressions, and returns a list of dictionaries containing the start and end times of each speech segment.

7. The `parse_text_timestamps_file` function parses the aligned text timestamps file and extracts the text segments along with their start and end times. It reads the file line by line, matches the start time, end time, and text using regular expressions, and returns a list of dictionaries containing the start time, end time, and text of each segment.

8. The `combine_segments` function combines the VAD segments with the text segments to create audio-text pairs. It iterates over each text segment, finds the overlapping VAD segments, and combines them into chunks of a specified maximum duration. It returns a list of dictionaries containing the chunk ID, chunk length, text, start time, and end time of each combined segment.

9. The `perform_vad` function performs Voice Activity Detection on the audio file using the `pyannote` library. It loads a pre-trained segmentation model, applies it to the audio file, and saves the VAD results to a file.

10. The `process_youtube_video` function is the main function that orchestrates the entire process. It takes a YouTube video URL as input, downloads the video, converts it to audio, performs VAD, transcribes the audio, aligns the transcript, parses the VAD and text segments, combines them, and returns the combined audio-text pairs as a JSON string.

11. Finally, the code creates a Gradio interface using the `gradio` library. It defines an interface with the `process_youtube_video` function as the main function, a text box for entering the YouTube video URL as input, and a JSON output for displaying the combined audio-text pairs. The interface is then launched using `iface.launch()`.

## Generalization
- The code provides a general approach for extracting audio-text pairs from YouTube videos. It can be applied to various types of videos, such as lectures, interviews, or presentations, where the goal is to align the spoken content with the corresponding text.

- The approach relies on several libraries and models, including FFmpeg for video-to-audio conversion, pytube for downloading YouTube videos, Whisper for audio transcription, ctc-forced-aligner for transcript alignment, and pyannote for Voice Activity Detection. These libraries and models have been trained on diverse datasets and are generally applicable to a wide range of audio and video content.

- However, there are potential failure modes and limitations to consider:
  - The accuracy of the transcription and alignment may vary depending on the audio quality, background noise, accents, and speaking styles present in the video. Videos with poor audio quality, heavy accents, or overlapping speech may result in less accurate transcriptions and alignments.
  - The code assumes that the video contains spoken content in English. For videos in other languages, the Whisper model and the forced alignment library would need to be adapted or replaced with models trained on the target language.
  - The VAD model used in the code is pre-trained and may not be optimal for all types of audio. In some cases, custom VAD models trained on domain-specific data may yield better results.
  - The maximum chunk duration is set to a fixed value (15 seconds in the code). Depending on the nature of the content, this value may need to be adjusted to ensure semantically meaningful segments.

- To adapt the code for other languages, the following modifications would be required:
  - Replace the Whisper model with a model trained on the target language for audio transcription.
  - Modify the forced alignment library or use a different alignment tool compatible with the target language.
  - Update the regular expressions used for parsing the VAD and text timestamp files to match the format of the output generated by the language-specific tools.
  - Adjust any language-specific parameters or settings in the code, such as the VAD model or the text preprocessing steps.

Overall, the code provides a starting point for extracting audio-text pairs from YouTube videos, but it may require further customization and fine-tuning based on the specific characteristics of the videos being processed and the desired output quality.

In [None]:
!nvidia-smi

Wed May 29 19:01:08 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   62C    P0              30W /  72W |   2839MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
import os
import sys
import subprocess

In [None]:
def video2mp3(video_file, output_ext="wav"):
    filename, ext = os.path.splitext(video_file)
    subprocess.call(["ffmpeg", "-y", "-i", video_file, f"{filename}.{output_ext}"],
                    stdout=subprocess.DEVNULL,
                    stderr=subprocess.STDOUT)
    return f"{filename}.{output_ext}"

In [None]:
!pip install pytube # installing library to download youtube video
!pip install -U openai-whisper
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa soundfile
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install -q bitsandbytes datasets accelerate
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git@main
!pip install -i https://pypi.org/simple/ bitsandbytes
!pip install torchaudio
!pip install torch

Collecting pytube
  Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytube
Successfully installed pytube-15.0.0
Collecting openai-whisper
  Downloading openai-whisper-20231117.tar.gz (798 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m798.6/798.6 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper)
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m57.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->openai-whisper)
  Using cached nv

In [None]:
from pytube import YouTube

YouTube('https://youtu.be/Sby1uJ_NFIY').streams.first().download()

'/content/Sarvam AI Wants To Leverage AI In Health & Education Says Co Founder Vivek Raghavan With OpenHathi.mp4'

In [None]:
input_video = '/content/Sarvam AI Wants To Leverage AI In Health & Education Says Co Founder Vivek Raghavan With OpenHathi.mp4'

In [None]:
!sudo apt-get install ffmpeg

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


In [None]:
audio_file = video2mp3(input_video)

## Transcribing with Open AI Whisper Model ver-large-v3

In [None]:
import whisper

def translate(audio, output_file="transcription.txt"):
    model = whisper.load_model("large-v3")
    options = dict(beam_size=5, best_of=5)
    translate_options = dict(task="translate", **options)
    result = model.transcribe(audio_file,**translate_options)
    # Save the result to a text file
    with open(output_file, "w") as file:
        file.write(result["text"])
    return result

In [None]:
result = translate(audio_file)
print(result["text"])

100%|██████████████████████████████████████| 2.88G/2.88G [00:24<00:00, 124MiB/s]


 Congratulations to you Mr. Raghavan for that. Thank you so much for joining us. Over to you. Hi everybody. How are you? I am not hearing this at all. It's like a post lunch energy downer or something. Let's hear it. Are you guys awake? Alright. You better be because we have a superstar guest here. You heard the $41 million and I didn't hear honestly anything she said after that. So we are going to ask for about $40 million from him by the end of this conversation. But let's get started. I want to introduce Vivek and Pratyush, his co-founder who is not here. We wanted to start with playing a video of what OpenHearty does. I encourage all of you to go to the website serverum.ai and check it out. But first of all, I want to thank you all for joining us. Let me start by introducing Vivek. Vivek is a dear friend and he is very, very modest, one of the most modest guys that I know. But his personal journey, Vivek, you got a PhD from Carnegie Mellon, you started and sold a company to Magma. 

*As in the video it can be seen that the speakers on the stage are Indian so Fine-tuning Whisper on Indian English accents can significantly improve the transcription accuracy, especially for technical or domain-specific terms, as evident from the errors in the above transcript.*

*In the given transcript, there are several instances where chemical or technical words are incorrectly transcribed. For example: open hearty should be OpenHathi, Hati in a lot of places is actually Hathi, Ai for Bharat is actually AI4Bharat*

Fine-tuning Whisper on a dataset of Indian English accents can help the model learn and adapt to the specific pronunciation patterns, intonations, and linguistic nuances of Indian speakers. By exposing the model to a diverse range of Indian English accents during training, it can better recognize and transcribe the unique ways in which Indians pronounce certain words, including technical terms and proper nouns.
Furthermore, fine-tuning Whisper on domain-specific data, such as conversations or speeches related to technology, AI, or chemistry, can further enhance its ability to accurately transcribe technical terms within those domains. By learning from a corpus of Indian English speech data that includes technical jargon and domain-specific vocabulary, the model can improve its recognition of chemical words and other specialized terms. So I went onto finetuning Whisper Large-V3 on the Svarah dataset created by AI4Bharat Team.

In [None]:
import torch
import gradio as gr
from transformers import (
    AutomaticSpeechRecognitionPipeline,
    WhisperForConditionalGeneration,
    WhisperTokenizer,
    WhisperProcessor,
    BitsAndBytesConfig
)
from peft import PeftModel, PeftConfig


peft_model_id = "rs545837/finetuned_whisper"
peft_config = PeftConfig.from_pretrained(peft_model_id)
language = "English"
task = "transcribe"
model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path, quantization_config=BitsAndBytesConfig(load_in_8bit=True), device_map="auto"
)

model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = WhisperTokenizer.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)
processor = WhisperProcessor.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)
feature_extractor = processor.feature_extractor
forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task=task)
pipe = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor)

def transcribe(audio, output_file):
    with torch.cuda.amp.autocast():
        text = pipe(audio, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)["text"]

    # Save the transcription to a file
    with open(output_file, "w") as file:
        file.write(text)

    return text


adapter_config.json:   0%|          | 0.00/771 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/63.0M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
transcribe("/content/Sarvam AI Wants To Leverage AI In Health & Education Says Co Founder Vivek Raghavan With OpenHathi.wav", "my_transcription.txt")



" Congratulations to you Mr. Raghavan for that. Thank you so much for joining us. Over to you. Hi everybody. How are you? I am not hearing this at all. It's like a post lunch energy downer or something. Let's hear it. Are you guys awake? Alright. You better be because we have a superstar guest here. You heard the $41 million and I didn't hear honestly anything she said after that. So we're going to ask for about $40 million from him by the end of this conversation, okay? But let's get started. I want to introduce Vivek and Pratyush, his co-founder who's not here. We wanted to start with playing a video of what OpenHearty does. I encourage all of you to go to the website serverum.ai and check it out. but let me start by introducing Vivek Vivek is a dear friend and he is very very modest one of the most modest guys that I know but his personal journey Vivek you got a PhD from Carnegie Mellon, you started and sold a company to Magma and Vivek and I moved back to India we were both in the 

In [None]:
import re

# Function to format the transcript
def format_transcript(transcript):
    # Ensure there is a space after every full stop
    transcript = re.sub(r'\.([A-Za-z])', r'. \1', transcript)

    # Replace specific words with the correct spelling
    replacements = {
        'open hathi': 'OpenHathi',
        'kruthrim': 'Krutrim',
        'open hearty': 'OpenHathi',
        'Aadhar': 'Aadhaar',
        'open AI': 'OpenAI',
        'open happy': 'OpenHathi',
        'sarom': 'Sarvam',
        'Sarwam': 'Sarvam',
        'Bhave Shakarwal': 'Bhavish Aggarwal',
        'Lama': 'Llama',
        'Bhavesh': 'Bhavish',
        'Bhavesh Agarwal': 'Bhavish Aggarwal'
    }

    for old, new in replacements.items():
        transcript = re.sub(re.escape(old), new, transcript, flags=re.IGNORECASE)

    return transcript

# Read the transcript from the file
with open('my_transcription.txt', 'r') as file:
    transcript = file.read()

# Format the transcript
formatted_transcript = format_transcript(transcript)

# Write the formatted transcript back to a file
with open('formatted_transcript.txt', 'w') as file:
    file.write(formatted_transcript)

print("Transcript formatted successfully.")

Transcript formatted successfully.


In [None]:
!pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
!pip install wheel torch

Collecting git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
  Cloning https://github.com/MahmoudAshraf97/ctc-forced-aligner.git to /tmp/pip-req-build-trzicv56
  Running command git clone --filter=blob:none --quiet https://github.com/MahmoudAshraf97/ctc-forced-aligner.git /tmp/pip-req-build-trzicv56
  Resolved https://github.com/MahmoudAshraf97/ctc-forced-aligner.git to commit 704fed8c8aecfa914b04d76be15aa2f5a0cf8103
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting Unidecode (from ctc-forced-aligner==0.2)
  Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: ctc-forced-aligner
  Building wheel for ctc-forced-aligner (pypro

In [None]:
# Align on a sentence level
!ctc-forced-aligner --audio_path "/content/Sarvam AI Wants To Leverage AI In Health & Education Says Co Founder Vivek Raghavan With OpenHathi.wav" --text_path "/content/formatted_transcript.txt" --language "eng" --split_size "sentence" --romanize --window_size "15" --device "cuda"

2024-05-29 16:55:24.701732: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-29 16:55:24.701782: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-29 16:55:24.703630: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [None]:
!pip install pyannote.audio

Collecting pyannote.audio
  Downloading pyannote.audio-3.2.0-py2.py3-none-any.whl (873 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m873.5/873.5 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting asteroid-filterbanks>=0.4 (from pyannote.audio)
  Downloading asteroid_filterbanks-0.4.0-py3-none-any.whl (29 kB)
Collecting einops>=0.6.0 (from pyannote.audio)
  Downloading einops-0.8.0-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Collecting lightning>=2.0.1 (from pyannote.audio)
  Downloading lightning-2.2.5-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m63.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting omegaconf<3.0,>=2.1 (from pyannote.audio)
  Downloading omegaconf-2.3.0-py3-none-any.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.5/79.5 kB[0m [31m8.6 MB/s[0

In [None]:
from pyannote.audio import Model, Inference

model = Model.from_pretrained("pyannote/segmentation-3.0")

pytorch_model.bin:   0%|          | 0.00/5.91M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/399 [00:00<?, ?B/s]

In [None]:
from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("/content/Sarvam AI Wants To Leverage AI In Health & Education Says Co Founder Vivek Raghavan With OpenHathi.wav")
# `vad` is a pyannote.core.Annotation instance containing speech regions
print(vad)
# Save VAD results to file
with open("vad.txt", 'w') as f:
    for segment, _, label in vad.itertracks(yield_label=True):
        if label == 'SPEECH':
            f.write(f"[{segment.start:.2f} --> {segment.end:.2f}] SPEECH\n")



print("VAD results saved to vad.txt")


[ 00:00:00.030 -->  00:00:04.350] 0 SPEECH
[ 00:00:08.687 -->  00:00:10.223] 0 SPEECH
[ 00:00:11.877 -->  00:00:17.091] 0 SPEECH
[ 00:00:18.222 -->  00:00:18.930] 0 SPEECH
[ 00:00:19.724 -->  00:00:20.736] 0 SPEECH
[ 00:00:21.900 -->  00:00:22.474] 0 SPEECH
[ 00:00:22.964 -->  00:00:24.685] 0 SPEECH
[ 00:00:25.157 -->  00:00:55.482] 0 SPEECH
[ 00:00:56.174 -->  00:00:59.937] 0 SPEECH
[ 00:01:00.291 -->  00:01:19.647] 0 SPEECH
[ 00:01:19.917 -->  00:01:42.023] 0 SPEECH
[ 00:01:42.242 -->  00:01:47.085] 0 SPEECH
[ 00:01:47.862 -->  00:02:20.211] 0 SPEECH
[ 00:02:20.312 -->  00:02:22.270] 0 SPEECH
[ 00:02:22.624 -->  00:02:31.669] 0 SPEECH
[ 00:02:32.597 -->  00:03:45.717] 0 SPEECH
[ 00:03:46.459 -->  00:04:25.052] 0 SPEECH
[ 00:04:25.862 -->  00:04:44.594] 0 SPEECH
[ 00:04:45.032 -->  00:04:46.416] 0 SPEECH
[ 00:04:48.914 -->  00:05:30.274] 0 SPEECH
[ 00:05:30.814 -->  00:05:31.709] 0 SPEECH
[ 00:05:32.249 -->  00:05:48.769] 0 SPEECH
[ 00:05:49.292 -->  00:08:41.991] 0 SPEECH
[ 00:08:42.

In [None]:
import re
from datetime import timedelta

# Helper function to convert time strings to seconds
def time_to_seconds(time_str):
    parts = time_str.split(':')
    try:
        if len(parts) == 3:
            h, m, s = map(float, parts)
        elif len(parts) == 2:
            h = 0
            m, s = map(float, parts)
        elif len(parts) == 1:
            h = 0
            m = 0
            s = float(parts[0])
        else:
            raise ValueError(f"Invalid time format: {time_str}")
        return timedelta(hours=h, minutes=m, seconds=s).total_seconds()
    except ValueError:
        print(f"Invalid time format: {time_str}")  # Print the invalid time string
        raise

# Parsing functions
def parse_vad_file(vad_file):
    vad_segments = []
    with open(vad_file, "r") as f:
        for line in f:
            match = re.match(r'\[([\d:.]+) --> ([\d:.]+)\] SPEECH', line)
            if match:
                start_time = match.group(1)
                end_time = match.group(2)
                vad_segments.append({"start": start_time, "end": end_time})
    return vad_segments

def parse_text_timestamps_file(text_file):
    text_segments = []
    with open(text_file, "r") as f:
        for line in f:
            match = re.match(r'([\d:.]+)-([\d:.]+): (.+)', line)
            if match:
                start_time = match.group(1)
                end_time = match.group(2)
                text = match.group(3)
                text_segments.append({"start": start_time, "end": end_time, "text": text})
    return text_segments

# Parse the VAD and text segments
vad_segments = parse_vad_file("vad.txt")
text_segments = parse_text_timestamps_file("/content/Sarvam AI Wants To Leverage AI In Health & Education Says Co Founder Vivek Raghavan With OpenHathi.txt")

print("VAD segments:", vad_segments)
print("Text segments:", text_segments)


VAD segments: [{'start': '0.03', 'end': '4.35'}, {'start': '8.69', 'end': '10.22'}, {'start': '11.88', 'end': '17.09'}, {'start': '18.22', 'end': '18.93'}, {'start': '19.72', 'end': '20.74'}, {'start': '21.90', 'end': '22.47'}, {'start': '22.96', 'end': '24.69'}, {'start': '25.16', 'end': '55.48'}, {'start': '56.17', 'end': '59.94'}, {'start': '60.29', 'end': '79.65'}, {'start': '79.92', 'end': '102.02'}, {'start': '102.24', 'end': '107.09'}, {'start': '107.86', 'end': '140.21'}, {'start': '140.31', 'end': '142.27'}, {'start': '142.62', 'end': '151.67'}, {'start': '152.60', 'end': '225.72'}, {'start': '226.46', 'end': '265.05'}, {'start': '265.86', 'end': '284.59'}, {'start': '285.03', 'end': '286.42'}, {'start': '288.91', 'end': '330.27'}, {'start': '330.81', 'end': '331.71'}, {'start': '332.25', 'end': '348.77'}, {'start': '349.29', 'end': '521.99'}, {'start': '522.21', 'end': '522.23'}, {'start': '522.31', 'end': '528.42'}, {'start': '530.26', 'end': '647.15'}, {'start': '648.03', '

In [None]:
def combine_segments(vad_segments, text_segments, max_duration=15):
    combined_pairs = []
    chunk_id = 1

    for text_segment in text_segments:
        text_start = time_to_seconds(text_segment["start"])
        text_end = time_to_seconds(text_segment["end"])

        segment_vads = []
        for vad_segment in vad_segments:
            vad_start = time_to_seconds(vad_segment["start"])
            vad_end = time_to_seconds(vad_segment["end"])

            # Check if VAD segment overlaps with the text segment
            if vad_start < text_end and vad_end > text_start:
                segment_vads.append(vad_segment)

        if not segment_vads:
            continue

        current_text = text_segment["text"]
        current_start_time = max(text_start, time_to_seconds(segment_vads[0]["start"]))
        current_end_time = min(text_end, time_to_seconds(segment_vads[-1]["end"]))
        current_duration = current_end_time - current_start_time

        while current_duration > max_duration:
            split_end_time = current_start_time + max_duration
            combined_pairs.append({
                "chunk_id": chunk_id,
                "chunk_length": max_duration,
                "text": current_text.strip(),
                "start_time": current_start_time,
                "end_time": split_end_time
            })
            chunk_id += 1
            current_text = ""
            current_start_time = split_end_time
            current_duration = current_end_time - current_start_time

        combined_pairs.append({
            "chunk_id": chunk_id,
            "chunk_length": current_duration,
            "text": current_text.strip(),
            "start_time": current_start_time,
            "end_time": current_end_time
        })
        chunk_id += 1

    return combined_pairs

# Parse the VAD and text segments
vad_segments = parse_vad_file("/content/vad.txt")
text_segments = parse_text_timestamps_file("/content/Sarvam AI Wants To Leverage AI In Health & Education Says Co Founder Vivek Raghavan With OpenHathi.txt")

# Combine the segments
audio_text_pairs = combine_segments(vad_segments, text_segments)

# Print the audio-text pairs
for pair in audio_text_pairs:
    print(f"Chunk ID: {pair['chunk_id']}, Length: {pair['chunk_length']:.2f}s, Start: {pair['start_time']:.2f}s, End: {pair['end_time']:.2f}s, Text: {pair['text']}")

# Save the audio-text pairs to a file
with open("audio_text_pairs.txt", "w") as f:
    for pair in audio_text_pairs:
        f.write(f"Chunk ID: {pair['chunk_id']}, Length: {pair['chunk_length']:.2f}s, Start: {pair['start_time']:.2f}s, End: {pair['end_time']:.2f}s, Text: {pair['text']}\n")

print("Audio-text pairs saved to audio_text_pairs.txt")

Chunk ID: 1, Length: 1.22s, Start: 0.06s, End: 1.28s, Text: Congratulations to you Mr.
Chunk ID: 2, Length: 0.90s, Start: 1.28s, End: 2.18s, Text: Raghavan for that.
Chunk ID: 3, Length: 1.42s, Start: 2.18s, End: 3.60s, Text: Thank you so much for joining us.
Chunk ID: 4, Length: 0.75s, Start: 3.60s, End: 4.35s, Text: Over to you.
Chunk ID: 5, Length: 0.79s, Start: 8.69s, End: 9.48s, Text: Hi everybody.
Chunk ID: 6, Length: 0.74s, Start: 9.48s, End: 10.22s, Text: How are you?
Chunk ID: 7, Length: 1.88s, Start: 11.88s, End: 13.76s, Text: I am not hearing this at all.
Chunk ID: 8, Length: 3.33s, Start: 13.76s, End: 17.09s, Text: It's like a post lunch energy downer or something.
Chunk ID: 9, Length: 0.71s, Start: 18.22s, End: 18.93s, Text: Let's hear it.
Chunk ID: 10, Length: 1.02s, Start: 19.72s, End: 20.74s, Text: Are you guys awake?
Chunk ID: 11, Length: 0.57s, Start: 21.90s, End: 22.47s, Text: Alright.
Chunk ID: 12, Length: 4.70s, Start: 22.96s, End: 27.66s, Text: You better be becau

In [None]:
import re

def parse_chunk_from_line(line):
    match = re.match(r'Chunk ID: (\d+), Length: ([\d.]+)s, Start: ([\d.]+)s, End: ([\d.]+)s, Text: (.+)', line)
    if match:
        chunk_id = int(match.group(1))
        chunk_length = float(match.group(2))
        start_time = float(match.group(3))
        end_time = float(match.group(4))
        text = match.group(5)
        return {
            "chunk_id": chunk_id,
            "chunk_length": chunk_length,
            "start_time": start_time,
            "end_time": end_time,
            "text": text
        }
    return None

def combine_chunks(chunks, max_duration=15):
    combined_chunks = []
    current_chunk = None

    for chunk in chunks:
        chunk_text = chunk["text"]
        chunk_start = chunk["start_time"]
        chunk_end = chunk["end_time"]
        chunk_duration = chunk_end - chunk_start

        if current_chunk is None:
            current_chunk = {
                "chunk_id": len(combined_chunks) + 1,
                "text": chunk_text,
                "start_time": chunk_start,
                "end_time": chunk_end,
                "chunk_length": chunk_duration
            }
        else:
            combined_text = current_chunk["text"] + " " + chunk_text
            combined_duration = chunk_end - current_chunk["start_time"]

            if combined_duration <= max_duration:
                current_chunk["text"] = combined_text
                current_chunk["end_time"] = chunk_end
                current_chunk["chunk_length"] = combined_duration
            else:
                combined_chunks.append(current_chunk)
                current_chunk = {
                    "chunk_id": len(combined_chunks) + 1,
                    "text": chunk_text,
                    "start_time": chunk_start,
                    "end_time": chunk_end,
                    "chunk_length": chunk_duration
                }

    if current_chunk is not None:
        combined_chunks.append(current_chunk)

    return combined_chunks

# Read chunks from the file
chunks = []
with open("audio_text_pairs.txt", "r") as file:
    for line in file:
        chunk = parse_chunk_from_line(line)
        if chunk is not None:
            chunks.append(chunk)

# Combine the chunks
combined_chunks = combine_chunks(chunks)

# Print and store the combined chunks
with open("combined_audio_text_pairs.txt", "w") as outfile:
    for chunk in combined_chunks:
        output_line = f"Chunk ID: {chunk['chunk_id']}, Length: {chunk['chunk_length']:.2f}s, Start: {chunk['start_time']:.2f}s, End: {chunk['end_time']:.2f}s, Text: {chunk['text']}\n"
        print(output_line, end="")
        outfile.write(output_line)

print("Combined chunks saved to combined_audio_text_pairs.txt")

Chunk ID: 1, Length: 13.70s, Start: 0.06s, End: 13.76s, Text: Congratulations to you Mr. Raghavan for that. Thank you so much for joining us. Over to you. Hi everybody. How are you? I am not hearing this at all.
Chunk ID: 2, Length: 13.90s, Start: 13.76s, End: 27.66s, Text: It's like a post lunch energy downer or something. Let's hear it. Are you guys awake? Alright. You better be because we have a superstar guest here.
Chunk ID: 3, Length: 13.58s, Start: 27.66s, End: 41.24s, Text: You heard the $41 million and I didn't hear honestly anything she said after that. So we're going to ask for about $40 million from him by the end of this conversation, okay? But let's get started.
Chunk ID: 4, Length: 14.24s, Start: 41.24s, End: 55.48s, Text: I want to introduce Vivek and Pratyush, his co-founder who's not here. We wanted to start with playing a video of what OpenHearty does. I encourage all of you to go to the website serverum. ai and check it out.
Chunk ID: 5, Length: 15.00s, Start: 56.17

In [None]:
import re
import json

def parse_chunk_from_line(line):
    match = re.match(r'Chunk ID: (\d+), Length: ([\d.]+)s, Start: ([\d.]+)s, End: ([\d.]+)s, Text: (.+)', line)
    if match:
        chunk_id = int(match.group(1))
        chunk_length = float(match.group(2))
        start_time = float(match.group(3))
        end_time = float(match.group(4))
        text = match.group(5)
        return {
            "chunk_id": chunk_id,
            "chunk_length": chunk_length,
            "start_time": start_time,
            "end_time": end_time,
            "text": text
        }
    return None

def combine_chunks(chunks, max_duration=15):
    combined_chunks = []
    current_combined = None

    for chunk in chunks:
        if current_combined is None:
            current_combined = chunk.copy()
        else:
            if (current_combined['chunk_length'] + chunk['chunk_length']) <= max_duration:
                current_combined['text'] += " " + chunk['text']
                current_combined['end_time'] = chunk['end_time']
                current_combined['chunk_length'] = current_combined['end_time'] - current_combined['start_time']
            else:
                combined_chunks.append(current_combined)
                current_combined = chunk.copy()

    if current_combined is not None:
        combined_chunks.append(current_combined)

    return combined_chunks

# Read chunks from the file
chunks = []
input_file_path = "audio_text_pairs.txt"
with open(input_file_path, "r") as file:
    for line in file:
        chunk = parse_chunk_from_line(line)
        if chunk is not None:
            chunks.append(chunk)

# Combine the chunks
combined_chunks = combine_chunks(chunks)

# Save combined chunks to a JSON file
output_file_path = "combined_chunks.json"
with open(output_file_path, "w") as output_file:
    json.dump(combined_chunks, output_file, indent=4)

print(f"Combined chunks saved to {output_file_path}")

Combined chunks saved to combined_chunks.json


In [None]:
import os
import re
import json
import subprocess
import whisper
import gradio as gr
from datetime import timedelta
from pytube import YouTube
from pyannote.audio.pipelines import VoiceActivityDetection
from pyannote.core import Segment
from pyannote.audio import Model, Inference

def video2mp3(video_file, output_ext="wav"):
    filename, ext = os.path.splitext(video_file)
    audio_file = f"{filename}.{output_ext}"
    print(f"Converting {video_file} to {audio_file}...")
    subprocess.call(["ffmpeg", "-y", "-i", video_file, audio_file],
                    stdout=subprocess.DEVNULL,
                    stderr=subprocess.STDOUT)
    if os.path.exists(audio_file):
        print(f"Audio file {audio_file} created successfully.")
    else:
        print(f"Failed to create audio file {audio_file}.")
    return audio_file

def download_video(url):
    yt = YouTube(url)
    video = yt.streams.filter(progressive=True, file_extension='mp4').order_by('resolution').desc().first()
    video_file = video.download()
    if os.path.exists(video_file):
        print(f"Video file {video_file} downloaded successfully.")
    else:
        print(f"Failed to download video from {url}.")
    return video_file

def transcribe(audio_file):
    print(f"Loading Whisper model and transcribing {audio_file}...")
    model = whisper.load_model("large-v3")
    result = model.transcribe(audio_file)
    print(f"Transcription completed.")
    return result["text"]

def align_transcript(audio_file, transcript):
    transcript_path = "temp_transcript.txt"

    with open(transcript_path, "w") as f:
        f.write(transcript)

    print(f"Aligning transcript with audio...")
    try:
        result = subprocess.run([
            "ctc-forced-aligner",
            "--audio_path", audio_file,
            "--text_path", transcript_path,
            "--language", "eng",
            "--split_size", "sentence",
            "--romanize",
            "--window_size", "15"
        ], capture_output=True, text=True)

        if result.returncode != 0:
            print("Error in forced alignment:")
            print(result.stderr)
            return None  # Indicate failure
        else:
            print("Alignment completed successfully.")
    except FileNotFoundError:
        print("ctc-forced-aligner not found. Please ensure it is installed and in your PATH.")
        return None  # Indicate failure

    output_path = f"{os.path.splitext(audio_file)[0]}.txt"
    if os.path.exists(output_path):
        print(f"Alignment file {output_path} created successfully.")
        return output_path  # Return the path to the output file
    else:
        print(f"Failed to create alignment file {output_path}.")
        return None  # Indicate failure

def time_to_seconds(time_str):
    parts = time_str.split(':')
    if len(parts) == 3:
        h, m, s = map(float, parts)
    elif len(parts) == 2:
        h = 0
        m, s = map(float, parts)
    elif len(parts) == 1:
        h = 0
        m = 0
        s = float(parts[0])
    else:
        raise ValueError(f"Invalid time format: {time_str}")
    return timedelta(hours=h, minutes=m, seconds=s).total_seconds()


# Parsing functions
def parse_vad_file(vad_file):
    vad_segments = []
    with open(vad_file, "r") as f:
        for line in f:
            match = re.match(r'\[([\d:.]+) --> ([\d:.]+)\] SPEECH', line)
            if match:
                start_time = match.group(1)
                end_time = match.group(2)
                vad_segments.append({"start": start_time, "end": end_time})
    return vad_segments

def parse_text_timestamps_file(text_file):
    text_segments = []
    try:
        with open(text_file, "r") as f:
            for line in f:
                match = re.match(r'([\d:.]+)-([\d:.]+): (.+)', line)
                if match:
                    start_time = match.group(1)
                    end_time = match.group(2)
                    text = match.group(3)
                    text_segments.append({"start": start_time, "end": end_time, "text": text})
    except FileNotFoundError:
        print(f"Text timestamps file '{text_file}' not found. Returning empty segments.")
    print(f"Parsed {len(text_segments)} text segments.")
    return text_segments

def combine_segments(vad_segments, text_segments, max_duration=15):
    combined_pairs = []
    chunk_id = 1

    print(f"Combining {len(vad_segments)} VAD segments with {len(text_segments)} text segments...")

    for text_segment in text_segments:
        text_start = time_to_seconds(text_segment["start"])
        text_end = time_to_seconds(text_segment["end"])

        segment_vads = []
        for vad_segment in vad_segments:
            vad_start = time_to_seconds(vad_segment["start"])
            vad_end = time_to_seconds(vad_segment["end"])

            if vad_start < text_end and vad_end > text_start:
                segment_vads.append(vad_segment)

        if not segment_vads:
            continue

        current_text = text_segment["text"]
        current_start_time = max(text_start, time_to_seconds(segment_vads[0]["start"]))
        current_end_time = min(text_end, time_to_seconds(segment_vads[-1]["end"]))
        current_duration = current_end_time - current_start_time

        while current_duration > max_duration:
            split_end_time = current_start_time + max_duration
            combined_pairs.append({
                "chunk_id": chunk_id,
                "chunk_length": max_duration,
                "text": current_text.strip(),
                "start_time": current_start_time,
                "end_time": split_end_time
            })
            chunk_id += 1
            current_text = ""
            current_start_time = split_end_time
            current_duration = current_end_time - current_start_time

        combined_pairs.append({
            "chunk_id": chunk_id,
            "chunk_length": current_duration,
            "text": current_text.strip(),
            "start_time": current_start_time,
            "end_time": current_end_time
        })
        chunk_id += 1

    return combined_pairs

def perform_vad(audio_file):
    print("Performing Voice Activity Detection...")
    model = Model.from_pretrained("pyannote/segmentation-3.0")
    pipeline = VoiceActivityDetection(segmentation=model)
    HYPER_PARAMETERS = {
    # remove speech regions shorter than that many seconds.
    "min_duration_on": 0.0,
    # fill non-speech regions shorter than that many seconds.
    "min_duration_off": 0.0
    }
    pipeline.instantiate(HYPER_PARAMETERS)
    vad = pipeline(audio_file)
    # `vad` is a pyannote.core.Annotation instance containing speech regions

    # Save VAD results to file
    with open("vad.txt", 'w') as f:
        for segment, _, label in vad.itertracks(yield_label=True):
            if label == 'SPEECH':
                f.write(f"[{segment.start:.2f} --> {segment.end:.2f}] SPEECH\n")

    print("VAD results saved to vad.txt")

def process_youtube_video(url):
    try:
        print("Downloading video...")
        video_file = download_video(url)
        print("Video downloaded successfully.")

        print("Converting video to audio...")
        audio_file = video2mp3(video_file)
        print("Video converted to audio successfully.")

        perform_vad(audio_file)

        print("Transcribing audio...")
        transcript = transcribe(audio_file)
        print("Audio transcribed successfully.")

        print("Aligning transcript...")
        alignment_output_path = align_transcript(audio_file, transcript)
        if alignment_output_path is None:
            print("Failed to align transcript. Exiting process.")
            return {"error": "Failed to align transcript."}
        print("Transcript aligned successfully.")

        vad_segments = parse_vad_file("vad.txt")
        print(f"VAD Segments: {vad_segments}")

        text_segments = parse_text_timestamps_file(alignment_output_path)
        print(f"Text Segments: {text_segments}")

        audio_text_pairs = combine_segments(vad_segments, text_segments)
        print(f"Combined Segments: {audio_text_pairs}")

        json_output = json.dumps(audio_text_pairs, indent=4)
        print("JSON output:", json_output)  # Print the JSON output for debugging
        return json_output
    except Exception as e:
        print("An error occurred:", str(e))  # Print any errors that occur
        return {"error": str(e)}

iface = gr.Interface(
    fn=process_youtube_video,
    inputs=gr.Textbox(label="YouTube Video URL"),
    outputs=gr.JSON(label="Combined Chunks JSON"),
    title="YouTube Video Processor",
    description="Enter a YouTube video URL and get the combined chunks as JSON."
)

iface.launch()

# # For running the function directly without Gradio interface
# url = "https://www.youtube.com/shorts/HN0PZqL-CmE"
# process_youtube_video(url)


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://e5866e8b0e16006271.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


