# Description:
This notebook focuses on video analysis using Generative AI. It begins by extracting audio from a video, transcribing it into text, and then applying AI techniques to query or summarize the transcription, offering quick insights. The video is also broken down into frames at a configurable rate, allowing for flexible image generation. These frames are processed for visual question-answering, with redundant information removed, and a combined response is generated using a large language model (LLM). This approach enables effective interaction with video content to extract meaningful insights.

## Install necessary libraries
These instructions are for setting up a Python environment with various libraries and tools, mainly for handling video processing, machine learning, and serving AI models. Here's a breakdown:

- **opencv-python**: Installs OpenCV, a library for computer vision tasks.
- **moviepy**: Installs MoviePy, a library for video editing.
- **ffmpeg**: Installs FFmpeg, a tool for handling multimedia files.
- **transformers**: Installs Hugging Face's Transformers library for working with transformer-based models.
- **torch, torchvision, torchaudio**: Installs PyTorch and related libraries for machine learning tasks with images (torchvision) and audio (torchaudio).
- **accelerate**: Installs Accelerate, a library for optimizing and distributing model training.
- **pillow, flask**: Installs Pillow (image handling) and Flask (for creating web applications).
- **txtai**: Installs the txtai library (with API and pipeline options), which enables AI-powered search and embeddings.

These are used to set up an environment for video, image, and text processing with machine learning capabilities.

In [None]:
!pip install opencv-python
!pip3 install moviepy
!sudo apt install ffmpeg -y
!pip3 install transformers
!pip3 install torch torchvision torchaudio
!pip3 install accelerate -y
!pip3 install torch pillow flask
!pip install git+https://github.com/neuml/txtai#egg=txtai[api,pipeline]

## Hugging Face Login
This code imports the `notebook_login` function from the Hugging Face Hub and calls it to prompt the user to enter their Hugging Face token. This token is necessary to authenticate the user and gain access to Hugging Face's resources, such as models and datasets, from within a Jupyter Notebook or similar environment.

In [None]:
from huggingface_hub import notebook_login

# This will prompt you to enter your Hugging Face token
notebook_login()

## TXTAI directory setup
This code sets up a directory named `txtai` by first ensuring that any pre-existing directory with that name is removed. It starts by defining the path to the directory as `"txtai"`, and then runs a shell command to check if the directory already exists. If it does, it deletes the directory along with all of its contents using the `sudo rm -rf` command. After clearing any previous data, the code creates a new directory named `txtai` using the `sudo mkdir -p` command, which ensures the directory is created without any errors, even if it already exists. This guarantees that a fresh, empty directory is available for use.

In [None]:
# Define the directory path
directory_path = "txtai"

# Execute the shell commands
!if [ -d "{directory_path}" ]; then sudo rm -rf "{directory_path}"; fi
!sudo mkdir -p "{directory_path}"

## TXTAI Pipeline
This code initializes a transcription model by importing the `Transcription` class from the `txtai.pipeline` module, which is designed to convert audio into text. By calling `Transcription()`, it creates an instance named `transcribe`, which can now be used to handle transcription tasks. This allows the user to input audio files and convert spoken language into written text automatically. In summary, the code sets up a transcription model from the `txtai` library, enabling efficient audio-to-text conversion.

In [None]:
from txtai.pipeline import Transcription

# Create transcription model
transcribe = Transcription()


## Generate audio from video
This code handles video and audio processing by using the `moviepy.editor` module. It starts by importing the library as `mp` for video editing tasks. Next, it loads a video file called `your_video.mp4` into a `VideoFileClip` object, making it accessible for further manipulation. The code then extracts the audio track from the video, assigning the output to a file named `your_video.wav`. After successfully extracting and saving the audio, a shell command is executed to copy the `your_video.wav` file into the `txtai` directory using `sudo cp`. In summary, this code processes a video by extracting its audio, saving it in `.wav` format, and moving the audio file to a designated directory for further use.

**Replace all references of `your_video` or `your_video.mp4` or `your_video.wav` with the actual name of your video file or video clip.**

In [None]:
import moviepy.editor as mp

# Load the video file
video = mp.VideoFileClip("your_video.mp4")

# Extract audio from the video
audio_path = "your_video.wav"
video.audio.write_audiofile(audio_path)

!sudo cp your_video.wav ./txtai

## List of audio files to process
This code sets up the necessary imports and file list to display and play audio files within a Jupyter notebook. First, it imports the `Audio` and `display` functions from the `IPython.display` module, which are used to handle and showcase audio files. Then, it creates a list named `files` containing the file name `your_video.wav`. The second line modifies this list by prepending the path `txtai/` to each file name, resulting in a full file path like `txtai/your_video.wav`. In summary, this code prepares audio files for playback from the `txtai` directory and allows for easy extension by adding more audio files to the list.

In [None]:
from IPython.display import Audio, display

files = ["your_video.wav"] # Specify here for multiple file
files = ["txtai/%s" % x for x in files]

## Transcription


This code initializes a transcription model using OpenAI's Whisper model, specifically the `"openai/whisper-base"` version, which is designed for automatic speech recognition (ASR) tasks. The model, `transcribe`, is used to process audio files listed in the `files` variable (e.g., `"txtai/your_video.wav"`). The code iterates over the transcription results, where the `transcribe(files)` function processes each audio file and converts spoken content into text. Finally, the transcriptions are printed to the console using `print(text)`. In summary, this code leverages the Whisper model to transcribe audio files and display the transcribed text.

In [None]:
# Transcribe files
transcribe = Transcription("openai/whisper-base")
for text in transcribe(files):
  print(text)

## Ingestion directory setup
This code below sets up the directory for document ingestion into a vector database. It first defines the directory path as `/home/ubuntu/documents` and checks if the directory already exists. If it does, the directory and its contents are deleted using the `rm -rf` command. Afterward, a new directory is created at the same location using the `mkdir -p` command. This ensures a clean directory structure is prepared for ingesting documents into the vector database.

In [None]:
# Define the directory path for ingestion of documents into vector db
ingestion_path = "/home/ubuntu/documents"

# Execute the shell commands
!if [ -d "{ingestion_path}" ]; then sudo rm -rf "{ingestion_path}"; fi
!sudo mkdir -p "{ingestion_path}"

## Chat with transcribed text using Gen AI LLM
This code uses a pre-trained question-answering model from the `transformers` library to answer questions about video content. The `ask_question` function takes a question and uses the `qa_pipeline` to generate an answer based on a provided context (the transcribed video text). An example question, "What happened to the 17 year old boy?", is passed to the function, and the answer is printed. This allows users to interactively ask questions about a video using natural language processing.

In [None]:
from transformers import pipeline
import time

# Load a question-answering pipeline
qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad", device=-1)

# Function to ask questions about the video
def ask_question(question):
    context = text
    result = qa_pipeline(question=question, context=context, max_tokens=100)
    return result

# Example usage
question = "What happened to the 17 year old boy?"
t1 = time.time()
answer = ask_question(question)
t2 = time.time()
t3 = t2 - t1

print("QnA from transcribed text takes: ", t3, " seconds")
print(f"Question: {question}\nAnswer: {answer['answer']}")

## Summarize transcribed text using Gen AI LLM
This code utilizes the `transformers` library to generate a summary of a given document using a pre-trained summarization model, specifically `"facebook/bart-large-cnn"`. The summarization process is controlled by parameters such as `do_sample`, `top_k`, and `top_p`, which adjust the sampling strategy for the next-word prediction to create varied or focused summaries. The time taken for the summarization process is measured, and both the generated summary and the time duration are printed. This approach provides a concise summary of the document in a controlled and optimized manner.

In [None]:
from transformers import pipeline
import time

# Load a summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")


document = text

# Generate the summary
# do_sample: When set to True, the model samples the next word from the probability distribution, potentially leading to more varied summaries.
# Setting do_sample to False uses greedy decoding, which may be more accurate for some contexts.
# top_k: Limits the number of next-token predictions. A smaller top_k will make the output more focused.
# top_p: Nucleus sampling, which allows the model to consider a smaller number of potential next tokens. Lowering top_p can lead to a more concise summary.
t1 = time.time()
summary = summarizer(text, max_length=150, min_length=50, do_sample=True, top_k=50, top_p=0.9)
t2 = time.time()
t3 = t2 - t1

# Print the summary
print(summary[0]['summary_text'])
print("Summarization took in seconds: ", t3)

## Create frames from the video
This code extracts frames from a video at a rate of 1 frame per second and saves each frame as a PNG image. The process begins by capturing the current time, then loops through the video duration, extracting frames at each second and saving them with filenames like frame_0.png, frame_1.png, etc. After the extraction, it calculates and prints the total time taken for the process. This allows for easy extraction and saving of video frames for further processing or analysis.

In [None]:
# Optionally, you can save or further process the extracted frames
# Extracting and saving frames as images
import time
t1 = time.time()
frame_rate = 1  # 1 frame per second
for t in range(0, int(video.duration), frame_rate):
    frame = video.get_frame(t)
    frame_image = mp.ImageClip(frame)
    frame_image.save_frame(f"frame_{t}.png")

t2 = time.time()
t3 = t2 - t1
print("total time: ",t3)

## Copy Frames in a directory
This code manages the setup of a directory for storing extracted video frames. It defines the directory path as `"frames"` and checks if this directory already exists. If it does, the directory and its contents are removed. After that, a new `frames` directory is created using the `mkdir -p` command. Finally, all files matching the pattern `frame*.*` (such as extracted frame images) are copied into the `frames` directory for organized storage.

In [None]:
# Define the directory path
directory_path = "frames"

# Execute the shell commands
!if [ -d "{directory_path}" ]; then sudo rm -rf "{directory_path}"; fi
!sudo mkdir -p "{directory_path}"
!sudo cp -r frame*.* ./frames

## Query the images from the video and generate response
### Code Explanation with Main Points Highlighted:

1. **Check for CUDA Availability**:
   - The code first checks if **CUDA (GPU)** is available for faster computation. If CUDA is available, it uses the GPU for processing, otherwise, it falls back to the **CPU**. This ensures efficient use of resources based on the hardware.

2. **Initialize VQA Model and Processor**:
   - It sets up a **Visual Question Answering (VQA)** model and processor from Hugging Face, specifically the **"dandelin/vilt-b32-finetuned-vqa"** model. The model is designed to process both text (questions) and images (frames) to answer visual questions.

3. **Initialize GPT-Neo Model and Tokenizer**:
   - The code also initializes the **GPT-Neo language model**, which is a powerful text-generation model. The tokenizer is used to convert text inputs into a format that the model can process. This model is later used to refine the answers produced by the VQA model into more fluent and coherent responses.

4. **Load Frames**:
   - It defines a function to load **image frames** (in PNG format) from a specified directory. The frames are stored in memory and used for querying. Each frame represents a snapshot from the video, and these are processed later to extract relevant information.

5. **Query Frames**:
   - For each frame, the **VQA model** is used to answer a given question. The model processes both the text of the question and the visual content of each frame to generate a relevant answer for each frame. These answers are collected for further processing.

6. **Combine Answers**:
   - Once the answers are collected from the frames, they are combined into a **single coherent response**. The function handles specific types of questions, such as those asking for counts ("How many"), and produces a logical summary of the answers provided by the VQA model.

7. **Refine Response with GPT-Neo**:
   - After combining the answers, the **GPT-Neo model** refines the response to make it more natural and coherent. The GPT-Neo model takes the raw combined response and generates a well-structured, fluent answer in plain English.

8. **Timing and Execution**:
   - The code also tracks how long each major step (querying frames, combining answers, and refining the response) takes. This is useful for performance analysis and optimization, providing insight into the time complexity of each process.

9. **Final Response**:
   - After all the steps are completed, the final, refined response is printed. This answer is the system's conclusion, based on the question and the information extracted from the video frames, providing a seamless interaction between visual and textual data processing.

In [None]:
import os
from PIL import Image
from transformers import ViltProcessor, ViltForQuestionAnswering, GPTNeoForCausalLM, GPT2Tokenizer
import torch
import json

# Check if CUDA is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using GPU for computation")
else:
    device = torch.device("cpu")
    print("CUDA is not available, using CPU for computation")

# Initialize the VQA model and processor
model_name = "dandelin/vilt-b32-finetuned-vqa"
processor = ViltProcessor.from_pretrained(model_name)
model = ViltForQuestionAnswering.from_pretrained(model_name).to(device)

# Initialize the GPT-Neo model and tokenizer
gpt_neo_tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
gpt_neo_model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B").to(device)

# Directory containing PNG files
frames_dir = '/home/ubuntu/frames'

# Load frames
def load_frames():
    frames = {}
    for frame_file in os.listdir(frames_dir):
        if frame_file.endswith('.png'):
            img_path = os.path.join(frames_dir, frame_file)
            img = Image.open(img_path)
            frames[frame_file] = img
    return frames

frames = load_frames()

# Function to query frames
def query_frames(query):
    results = []
    for frame_file, img in frames.items():
        inputs = processor(text=query, images=img, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**inputs.to(device))

        answer_idx = outputs.logits.argmax(-1).item()
        answer = model.config.id2label[answer_idx] if answer_idx in model.config.id2label else f"[unused{answer_idx}]"

        results.append(answer)
    return results

# Combine answers into a meaningful response
def combine_answers(query, answers):
    unique_answers = set(answers)
    if query.lower().startswith("how many"):
        num_counts = {}
        for answer in unique_answers:
            num_counts[answer] = answers.count(answer)
        num_answer = " and ".join([f"{count} {ans}" for ans, count in num_counts.items()])
        combined_response = f"There are {num_answer} in the video."
    else:
        combined_response = " ".join(unique_answers)

    return combined_response

# Use GPT-Neo to refine the response
def refine_response_with_gpt_neo(question, combined_response):
    prompt = f"Question: {question}\nCombined Response: {combined_response}\nProvide a single, coherent English answer based on the combined response:"

    inputs = gpt_neo_tokenizer.encode(prompt, return_tensors="pt").to(device)
    outputs = gpt_neo_model.generate(
        inputs,
        max_length=100,
        min_length=50,
        num_return_sequences=1,
        temperature=0.7,
        pad_token_id=gpt_neo_tokenizer.eos_token_id,
        eos_token_id=gpt_neo_tokenizer.eos_token_id,
        no_repeat_ngram_size=2  # Ensure the model does not repeat phrases
    )

    generated_text = gpt_neo_tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract the answer from the generated text
    refined_answer = generated_text.split('Provide a single, coherent English answer based on the combined response:')[-1].strip()

    return refined_answer

# Example query
query = "What was the seargent wearing who was giving the speech?"

t1 = time.time()
answers = query_frames(query)
t2 = time.time()
t3 = t2 - t1
print("query_frames time: ", t3)


t1 = time.time()
combined_response = combine_answers(query, answers)
t2 = time.time()
t3 = t2 - t1
print("combine_answers time: ", t3)

t1 = time.time()
final_response = refine_response_with_gpt_neo(query, combined_response)
t2 = time.time()
t3 = t2 - t1
print("refine_response_with_gpt_neo time: ", t3)

# Display results
print("Final Response:", final_response.split('.')[0].strip())
