###Converting Audio to Subtitles for Enhanced Searchability  

With the increasing volume of audio and video content, converting speech into searchable text is essential for **content accessibility, indexing, and search engine optimization**. Automatic speech recognition (ASR) enables the extraction of subtitles from audio files, making them easier to analyze, retrieve, and enhance with AI-powered search techniques.  

This notebook focuses on **automatically generating subtitles from audio files** using advanced machine learning models. The extracted subtitles can then be processed and indexed for better **search relevance, semantic retrieval, and AI-driven recommendations**.  

### Steps Involved:

1. **Install Required Libraries**  
   - We install key dependencies like **Torch, Sentence-Transformers, OpenAI-Whisper, FFmpeg, and SciPy**, which are crucial for processing audio and generating text.  

2. **Load and Process Audio Files**  
   - Audio files are preprocessed to ensure compatibility with speech recognition models.  

3. **Generate Subtitles Using OpenAI Whisper**  
   - Whisper, a powerful ASR model, transcribes audio into high-quality text with minimal errors.  

4. **Store and Optimize Subtitles**  
   - Extracted subtitles are saved in a structured format for further cleaning and indexing.  

By the end of this notebook, we will have a **fully functional subtitle generation pipeline**, ready for integration into search engines or AI-based retrieval systems.

In [1]:
#Step 1: Install Required Libraries

!pip install torch pandas chromadb sentence-transformers openai-whisper ffmpeg scipy soundfile

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting openai-whisper
  Downloading openai-whisper-20240930.tar.gz (800 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.5/800.5 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting ffmpeg
  Downloading ffmpeg-1.4.tar.gz (5.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-non

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
#Step 2: Import Libraries

import torch
import whisper
import numpy as np
import pandas as pd
import chromadb
import scipy.spatial.distance as distance
from sentence_transformers import SentenceTransformer
import soundfile as sf


In [4]:
#step 3:Load Whisper Model (base model for fast transcription)
whisper_model = whisper.load_model("base")


100%|███████████████████████████████████████| 139M/139M [00:01<00:00, 91.5MiB/s]


In [5]:
#Step 4: Upload Audio File for Testing

from google.colab import files

uploaded = files.upload()
audio_file = list(uploaded.keys())[0]  # Get the uploaded file name
print(f"Uploaded File: {audio_file}")


Saving i_hear_voices.mp3 to i_hear_voices.mp3
Uploaded File: i_hear_voices.mp3


In [6]:
# Transcribe the uploaded audio file
result = whisper_model.transcribe(audio_file)
query_text = result["text"]
print(f"Transcribed Query: {query_text}")


Transcribed Query:  I hear voices in my head, they count to me, they understand, they talk to me You got cheels in your religion, thought the sign to keep you safe But when we're stuck getting broken, you start questioning your fate I have a...


In [7]:
#Step 6: Load ChromaDB Collection

# Load ChromaDB Client
chroma_client = chromadb.PersistentClient(path="/content/drive/MyDrive/search_engine/db")  # Adjust the path if needed
collection = chroma_client.get_collection("subtitle_chunks")  # Load subtitles collection

# Load subtitle metadata
df = pd.read_parquet("/content/drive/MyDrive/search_engine/files/subtitles_extracted.parquet")  # Adjust path if needed

In [8]:
#Step 7: Generate Embedding for the Query

# Load Sentence Transformer Model
device = "cuda" if torch.cuda.is_available() else "cpu"
embedder = SentenceTransformer("all-MiniLM-L6-v2", device=device)

# Generate embedding for query text
query_embedding = embedder.encode([query_text])[0]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Retrieving and Displaying Relevant Subtitles:

Once subtitles have been processed, chunked, and indexed, we can now leverage ChromaDB to retrieve the most relevant subtitle segments based on a user’s query. Using semantic search and embeddings, we can efficiently match queries to subtitle content, ensuring highly relevant search results.

How the Retrieval Works:

Generate Query Embeddings

The user’s search query is converted into an embedding (vector representation) to enable similarity matching.

Search ChromaDB for Similar Subtitles

We query the ChromaDB collection to retrieve the top 5 most relevant subtitle segments using vector similarity search.

The results include metadata such as subtitle_name, original_index, and subtitle_id to provide context.

Display Search Results

The retrieved results are formatted and displayed with essential details:

- Subtitle Name – The episode or movie name.

- Subtitle Index – The index reference for positioning.

- Link to Subtitle – A direct link for further exploration.

In [9]:
# Retrieve top 5 similar results from ChromaDB
results = collection.query(
    query_embeddings=[query_embedding.tolist()],  # Your query embedding
    n_results=5,  # Top 5 results
    include=["metadatas"]  # Include metadata
)

# Display the top 5 results
for i, metadata in enumerate(results["metadatas"][0]):
    subtitle_name = metadata.get("subtitle_name", "Unknown Episode")  # Adjusted key for subtitle name
    subtitle_index = metadata.get("original_index", "Index found")  # Adjusted key for content
    subtitle_id = metadata.get("subtitle_id", "Unknown ID")  # Assuming subtitle_id is stored as metadata

    # Print the result details
    print(f"{i + 1}. 🎥 **{subtitle_name}**")
    print(f"📌 {subtitle_index}")
    print(f"🔗 [Link to Subtitle](https://www.opensubtitles.org/en/subtitles/{subtitle_id})")
    print("-" * 50)


1. 🎥 **black.adam.(2022).eng.1cd**
📌 17306
🔗 [Link to Subtitle](https://www.opensubtitles.org/en/subtitles/9322190)
--------------------------------------------------
2. 🎥 **brainwashed.sexcamerapower.(2022).eng.1cd**
📌 10504
🔗 [Link to Subtitle](https://www.opensubtitles.org/en/subtitles/9420995)
--------------------------------------------------
3. 🎥 **american.experience.s31.e09.woodstock.three.days.that.defined.a.generation.(2019).eng.1cd**
📌 23476
🔗 [Link to Subtitle](https://www.opensubtitles.org/en/subtitles/9222769)
--------------------------------------------------
4. 🎥 **tatarin.(2001).eng.1cd**
📌 13325
🔗 [Link to Subtitle](https://www.opensubtitles.org/en/subtitles/9287461)
--------------------------------------------------
5. 🎥 **blood.treasure.s02.e13.showdown.in.hong.kong.(2022).eng.1cd**
📌 12293
🔗 [Link to Subtitle](https://www.opensubtitles.org/en/subtitles/9260166)
--------------------------------------------------


**Conclusion: Enhancing Search Engine for Video Subtitles**

By implementing a robust pipeline for subtitle extraction, cleaning, chunking, and semantic search, we significantly improve the searchability and relevance of video subtitles. This approach enables:

- Better Content Discovery – Users can find precise subtitle moments based on contextual meaning.

-  Improved Search Accuracy – AI-driven embeddings enhance search precision.

-  Efficient Storage & Retrieval – Optimized Parquet storage and ChromaDB indexing allow for scalable search.

This marks the completion of our video subtitle enhancement project, providing a powerful AI-driven search solution for improved video accessibility and information retrieval.