# Multimodal Rag

1. llamaindex framework
2. Lancedb
3. LLM model(GPT-4V)
4. GEMINI-PRO-VISION

Steps:
1. Download video from YouTube, process and store it.

2. Build Multi-Modal index and vector store for both texts and images.

3. Retrieve relevant images and context, use both to augment the prompt.

4. Using GPT4V for reasoning the correlations between the input query and augmented data and generating final response.

In [1]:
from moviepy.editor import VideoFileClip
from pathlib import Path
import speech_recognition as sr
from pytube import YouTube
from pprint import pprint

from PIL import Image
import matplotlib.pyplot as plt

import os
from dotenv import load_dotenv

In [2]:
# Load environment variables from .env file
load_dotenv()

# Retrieve the OPENAI_API_KEY
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
# print(OPENAI_API_KEY)




In [3]:
os.getcwd()

'd:\\iNeuron\\GenAI\\GenerativeAI\\Rag_from_scratch'

In [4]:
# Define the base path where you want to store everything
base_path = Path(os.getcwd())

# Define the subdirectories you need to create
video_url="https://youtu.be/3dhcmeOTZ_Q"
output_video_path = base_path / "video_data"
output_folder = base_path / "mixed_data"
output_audio_path = base_path / "mixed_data/output_audio.wav"

# Create the directories if they don't already exist
output_video_path.mkdir(parents=True, exist_ok=True)
output_folder.mkdir(parents=True, exist_ok=True)

# Define the filepath for the input video
filepath = output_video_path / "input_vid.mp4"

# Print the paths to verify
print(f"Output Video Path: {output_video_path}")
print(f"Output Folder: {output_folder}")
print(f"Output Audio Path: {output_audio_path}")
print(f"Filepath: {filepath}")

Output Video Path: d:\iNeuron\GenAI\GenerativeAI\Rag_from_scratch\video_data
Output Folder: d:\iNeuron\GenAI\GenerativeAI\Rag_from_scratch\mixed_data
Output Audio Path: d:\iNeuron\GenAI\GenerativeAI\Rag_from_scratch\mixed_data\output_audio.wav
Filepath: d:\iNeuron\GenAI\GenerativeAI\Rag_from_scratch\video_data\input_vid.mp4


In [5]:
def plot_images(image_path):
    images_shown = 0
    plt.figure(figsize=(16, 9))
    for img_path in image_path:
        if os.path.isfile(img_path):
            image = Image.open(img_path)
            
            plt.subplot(2, 2, images_shown + 1)
            plt.imshow(image)
            plt.xticks([])
            plt.yticks([])
            
            images_shown += 1
            if images_shown >= 5:
                break

In [6]:
def download_video(video_url, output_path):
    yt = YouTube(video_url)
    metadata = {"Author": yt.author, "Title": yt.title, "Views": yt.views}
    yt.streams.get_highest_resolution().download(output_path=output_path, filename="input_vid.mp4")# filename=yt.title+".mp4") #
    return metadata

In [7]:
from moviepy.editor import VideoFileClip
# Function to convert video to images
def video_to_images(video_path, output_path):
    clip = VideoFileClip(str(video_path))  # Convert Path to string
    clip.write_images_sequence(
        os.path.join(str(output_path), "frame_%04d.png"), fps=0.2
    )

In [8]:
def video_to_audio(video_path, output_audio_path):
    clip = VideoFileClip(str(video_path))  # Convert Path to string
    audio=clip.audio
    audio.write_audiofile(str(output_audio_path))
    return audio
    
    

In [9]:
def audio_to_text(audio_path):
    recogniser = sr.Recognizer()
    audio = sr.AudioFile(str(audio_path))
    
    with audio as source:
        audio_data = recogniser.record(source)
        try:
            # recognise the speech
            text = recogniser.recognize_whisper(audio_data)
        except sr.UnknownValueError:
            print("Speech Recognition could not understand audio")
    return text

In [10]:
# import speech_recognition as sr
# from pathlib import Path

# def audio_to_text(audio_path, output_folder):
#     # Initialize the recognizer
#     recogniser = sr.Recognizer()
    
#     # Ensure the output folder exists
#     output_folder_path = Path(output_folder)
#     output_folder_path.mkdir(parents=True, exist_ok=True)
    
#     # Load the audio file
#     audio = sr.AudioFile(str(audio_path))
    
#     with audio as source:
#         # Adjust for ambient noise and record
#         recogniser.adjust_for_ambient_noise(source)
#         audio_data = recogniser.record(source)
        
#         try:
#             # Use Google's web speech API to recognize the speech
#             text = recogniser.recognize_google(audio_data)
#             print("Recognized text:", text)
            
#             # Write to file
#             output_file_path = output_folder_path / "recognized_text.txt"
#             with open(output_file_path, 'w') as file:
#                 file.write(text)
#             print(f"Text successfully written to {output_file_path}")
                
#         except sr.UnknownValueError:
#             print("Speech Recognition could not understand audio")
#             text = "Speech Recognition could not understand audio"
        
#         except sr.RequestError as e:
#             print(f"Could not request results from Google Speech Recognition service; {e}")
#             text = f"Error: {e}"
        
#         # Write any error messages or blank responses to the file as well
#         if text not in ["", "Speech Recognition could not understand audio"]:
#             with open(output_file_path, 'w') as file:
#                 file.write(text)
    
#     return text





In [11]:
video_url

'https://youtu.be/3dhcmeOTZ_Q'

In [12]:
output_video_path

WindowsPath('d:/iNeuron/GenAI/GenerativeAI/Rag_from_scratch/video_data')

In [13]:
metadata_vid = download_video(video_url, output_video_path)
metadata_vid

{'Author': '3-Minute Data Science',
 'Title': 'Linear Regression in 3 Minutes',
 'Views': 7829}

In [14]:
filepath

WindowsPath('d:/iNeuron/GenAI/GenerativeAI/Rag_from_scratch/video_data/input_vid.mp4')

In [15]:
output_folder

WindowsPath('d:/iNeuron/GenAI/GenerativeAI/Rag_from_scratch/mixed_data')

In [16]:
video_to_images(filepath,output_folder)

Moviepy - Writing frames d:\iNeuron\GenAI\GenerativeAI\Rag_from_scratch\mixed_data\frame_%04d.png.


                                                            

Moviepy - Done writing frames d:\iNeuron\GenAI\GenerativeAI\Rag_from_scratch\mixed_data\frame_%04d.png.




In [17]:
video_to_audio(filepath,output_audio_path)

MoviePy - Writing audio in d:\iNeuron\GenAI\GenerativeAI\Rag_from_scratch\mixed_data\output_audio.wav


                                                                      

MoviePy - Done.




<moviepy.audio.io.AudioFileClip.AudioFileClip at 0x186121acc10>

In [18]:
text_data=audio_to_text(output_audio_path)

In [19]:
text_data

" Lanyard regression is a statistical technique for modeling the relationship between an output variable and one or more input variables. In layman's terms, think of it as fitting a line through some data points as shown here, so you can make predictions on unknown data, assuming there is a linear relationship between the variables. You might be familiar with the linear function y equals mx plus b, where y is the output variable, also called the dependent variable. You may also see expressed as f of x, the function of the input variable. x on the other hand, would serve as the input variable, also called the independent variable. It's likely you'll see the coefficients m and b expressed as beta 1 and beta 0 respectively. So what do the m and b coefficients do? The m or beta 1 coefficient controls the slope of the line. The b or the beta 0 controls the intercept of the line. In machine learning, we also know it as the bias. These two coefficients are what we are solving for in linear re

In [20]:
# Write to file
output_file_path = output_video_path / "recognized_text.txt"
with open(output_file_path, 'w') as file:
    file.write(text_data)
print(f"Text successfully written to {output_file_path}")

Text successfully written to d:\iNeuron\GenAI\GenerativeAI\Rag_from_scratch\video_data\recognized_text.txt
