<a href="https://colab.research.google.com/github/xykkong/video-transcription-bert-faiss/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pt1. Preparation

## Installing Dependencies

In [19]:
!pip install SpeechRecognition
!pip install pydub
!pip install transformers
!pip install faiss-cpu
!pip install torch
!pip install pandas
!pip install pocketsphinx

Collecting pocketsphinx
  Downloading pocketsphinx-5.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.2/29.2 MB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sounddevice (from pocketsphinx)
  Downloading sounddevice-0.4.6-py3-none-any.whl (31 kB)
Installing collected packages: sounddevice, pocketsphinx
Successfully installed pocketsphinx-5.0.3 sounddevice-0.4.6


## Importing libraries

In [21]:
import speech_recognition as sr
from pydub import AudioSegment
from transformers import BertTokenizer, BertModel
import faiss
import torch
import argparse
import os
import pandas as pd

#Pt2. Extract audio from video

In [22]:
video_filename = "video1.mp4"
prefix = video_filename.split('.')[0]
audio_filename = f"{prefix}_audio.wav"
video = AudioSegment.from_file(video_filename, format="mp4")
audio = video.set_channels(1).set_frame_rate(16000).set_sample_width(2)
audio.export(audio_filename, format="wav")

<_io.BufferedRandom name='video1_audio.wav'>

#Pt3 Extract vide

In [23]:
# Initialize recognizer
recognizer = sr.Recognizer()

with sr.AudioFile(audio_filename) as source:
  # read the entire audio file
  audio_text = recognizer.record(source)

transcript = recognizer.recognize_sphinx(audio_text)

# Save the transcript
with open(f"{prefix}_transcript.txt", "w") as file:
  file.write(transcript)


# Pt4. Generate word embeedings

## Split transcription in a list of sentences

In [27]:
sentences = []
buffer = ''
with open('./video1_transcript.txt', 'r') as file:
  while True:
    chunk = file.read(1024)
    if not chunk:
      break
    text = (buffer + chunk).split('.')
    for sentence in text[:-1]:
      sentences.append(sentence.strip() + ".")
    buffer = text[-1]  # Incomplete sentence, stored for the next iteration

if buffer:  # Process any remaining incomplete sentence at the end of the file
  sentences.append(buffer.strip() + ".")

print(len(sentences))
sentences

1


["mm unknown one of the one fascinated by the natural world has probably wondered why so that a socially and he says would use the elephant with its huge year and one yet available for flight which old clothes look like feathers been listening for this would haunt him what about was abusive husbands like the centuries all of crocodiles can surely these evolutionary tree toward just for decoration update in fact the somalis shoot the location of these animals and shortages coal for her husband just like goes i was real oiler sensitive to do with each other find their way around stay safe and most importantly kept falling from when you see how superior some animal senses are compared to those of humans you might wonder how we ever managed to stay on top of the food chain and also purchases you won't believe what's possible how broth and that's what individuals who can read and sending receiving information that is what what i was used on haitian to attract mates mortgage predators defend

## Encoding sentences

In [28]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

encoding = tokenizer.batch_encode_plus(
    sentences,                 # List of input texts
    padding=True,              # Pad to the maximum sequence length
    truncation=True,           # Truncate to the maximum sequence length if necessary
    return_tensors='pt',       # Return PyTorch tensors
    add_special_tokens=True    # Add special tokens CLS and SEP
)
input_ids = encoding['input_ids']  # Token IDs
attention_mask = encoding['attention_mask']  # Attention mask

print(tokenizer.convert_ids_to_tokens(input_ids[0]))
print(f"Input ID: {input_ids}")
print(f"Attention mask: {attention_mask}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


['[CLS]', 'mm', 'unknown', 'one', 'of', 'the', 'one', 'fascinated', 'by', 'the', 'natural', 'world', 'has', 'probably', 'wondered', 'why', 'so', 'that', 'a', 'socially', 'and', 'he', 'says', 'would', 'use', 'the', 'elephant', 'with', 'its', 'huge', 'year', 'and', 'one', 'yet', 'available', 'for', 'flight', 'which', 'old', 'clothes', 'look', 'like', 'feathers', 'been', 'listening', 'for', 'this', 'would', 'haunt', 'him', 'what', 'about', 'was', 'abusive', 'husbands', 'like', 'the', 'centuries', 'all', 'of', 'crocodile', '##s', 'can', 'surely', 'these', 'evolutionary', 'tree', 'toward', 'just', 'for', 'decoration', 'update', 'in', 'fact', 'the', 'somali', '##s', 'shoot', 'the', 'location', 'of', 'these', 'animals', 'and', 'shortages', 'coal', 'for', 'her', 'husband', 'just', 'like', 'goes', 'i', 'was', 'real', 'oil', '##er', 'sensitive', 'to', 'do', 'with', 'each', 'other', 'find', 'their', 'way', 'around', 'stay', 'safe', 'and', 'most', 'importantly', 'kept', 'falling', 'from', 'when', 

##Generate word embeedings

In [29]:
# Generate embeddings using BERT model
outputs = None
with torch.no_grad():
  outputs = model(input_ids, attention_mask=attention_mask)
  embeddings = outputs.last_hidden_state.mean(dim=1)  # This contains the embeddings
dimension = embeddings.shape[1]
embeddings

tensor([[ 7.3845e-02,  2.4064e-01,  3.9455e-01, -7.7207e-02,  3.4091e-01,
         -2.9062e-01,  2.4080e-01,  5.8202e-01,  4.5681e-02, -3.8257e-01,
          4.3389e-01, -4.2421e-01, -8.9492e-02,  5.3538e-01, -7.6735e-01,
          2.5667e-01,  2.3315e-01,  4.6315e-02, -1.1845e-01,  3.3307e-01,
          4.8369e-01, -6.3948e-02,  1.7987e-03,  1.9575e-01,  4.4466e-01,
         -2.1665e-01, -6.8703e-02,  5.9327e-02, -1.2687e-01, -1.5661e-01,
          3.9326e-01,  1.8205e-01, -3.0477e-01, -1.1399e-01, -1.9866e-01,
         -1.5211e-01, -5.1954e-01, -7.6271e-03, -1.5939e-01,  3.2132e-01,
         -5.1335e-01, -3.3676e-01, -1.2710e-01,  1.2849e-01, -7.5629e-01,
         -1.8586e-01,  5.8936e-02, -6.8429e-02,  1.2338e-01,  5.6924e-03,
         -1.9549e-01, -5.7843e-03, -5.0394e-01, -1.7760e-01,  2.1725e-01,
          6.5395e-01,  1.0575e-01, -4.4441e-01, -6.7160e-01, -3.0942e-01,
          5.6059e-01,  1.3865e-01,  1.1479e-01, -3.2165e-01,  1.1511e-01,
         -3.5636e-02, -2.7275e-03,  7.

# Pt5. Indexing in Faiss

In [31]:
index = faiss.IndexFlatL2(dimension)  # BERT embedding size
faiss.normalize_L2(embeddings.numpy())
index.add(embeddings)
faiss.write_index(index, 'video1.index')

# Pt6. Querying Faiss

In [32]:
query = "test"
k = 5

encoding = tokenizer.encode_plus(
    query,                 # List of input texts
    padding=True,              # Pad to the maximum sequence length
    truncation=True,           # Truncate to the maximum sequence length if necessary
    return_tensors='pt',       # Return PyTorch tensors
    add_special_tokens=True    # Add special tokens CLS and SEP
)
query_input_ids = encoding['input_ids']  # Token IDs
query_attention_mask = encoding['attention_mask']  # Attention mask

with torch.no_grad():
  outputs = model(query_input_ids, attention_mask=query_attention_mask)
  query_embeddings = outputs.last_hidden_state.mean(dim=1)

distances, ann = index.search(query_embeddings.numpy(), k)
print(distances)
print(ann) #ann is the approximate nearest neighbour
results = pd.DataFrame({'distances': distances[0], 'ann': ann[0]})
sentences[ann[0][0]]
#merge = pd.merge(results, df, left_on='ann', right_index=True)
#results



[[6.9491692e+01 3.4028235e+38 3.4028235e+38 3.4028235e+38 3.4028235e+38]]
[[ 0 -1 -1 -1 -1]]


"mm unknown one of the one fascinated by the natural world has probably wondered why so that a socially and he says would use the elephant with its huge year and one yet available for flight which old clothes look like feathers been listening for this would haunt him what about was abusive husbands like the centuries all of crocodiles can surely these evolutionary tree toward just for decoration update in fact the somalis shoot the location of these animals and shortages coal for her husband just like goes i was real oiler sensitive to do with each other find their way around stay safe and most importantly kept falling from when you see how superior some animal senses are compared to those of humans you might wonder how we ever managed to stay on top of the food chain and also purchases you won't believe what's possible how broth and that's what individuals who can read and sending receiving information that is what what i was used on haitian to attract mates mortgage predators defend 