# Chat with the Video 👩🏻‍💻💬
* Video-LLaMA is a large multimodal model designed for video understanding, integrating both vision and audio processing. 
* We used the [Video-LLaMA](https://github.com/DAMO-NLP-SG/Video-LLaMA) repository and successfully adapted it to run on GPUs, you can check how to do that [here](https://github.com/rskasturi/usecases/tree/master/video_analytics).
* Powered by Intel® Data Center GPU Max 1100s, this notebook offers an accessible hands-on experience that doesn’t require deep technical knowledge.

## Overview 📖
In this notebook, you will learn how to run the multimodal Video-LLaMA on Intel Max Series GPUs, enabling you to explore its capabilities in processing both visual and textual data seamlessly.

1. Setting up the environment and optimizing it for Intel GPUs
2. Downloading the pretrained model [here](https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-7B-Finetuned)
3. Modify the YAML file by setting the model path, ImageBind path, and the paths for the video and audio checkpoints
4. Setting model arguments
5. Loading the model on Intel Max Series GPU
6. Loading the video and running the model

### Step 1: Setting Up the Environment ⚙️
Let's start by preparing our environment! We will import all the essential packages, including the Hugging Face transformers library, Video Llama dependencies and Intel Extension for Pytorch.

In [1]:
import argparse
import os
import random
import subprocess
import yaml
import numpy as np
import torch
import time
from video_llama.common.config import Config
from video_llama.common.dist_utils import get_rank
from video_llama.common.registry import registry
from video_llama.conversation.conversation_video import Chat, Conversation, default_conversation,SeparatorStyle,conv_llava_llama_2
import decord
from IPython.display import display, Video,Image
decord.bridge.set_bridge('torch')

from video_llama.datasets.builders import *
from video_llama.models import *
from video_llama.processors import *
from video_llama.runners import *
from video_llama.tasks import *

import intel_extension_for_pytorch as ipex 
from huggingface_hub import hf_hub_download, list_repo_files

  from .autonotebook import tqdm as notebook_tqdm


### Step 2: Downloading the pretrained model 🏃🏻
 With the Hugging Face Download Hub, we can easily download the models from Hugging Face.

In [2]:
#downloading models from hugging face 
if os.path.exists("./eval_configs/models--DAMO-NLP-SG--Video-LLaMA-2-7B-Finetuned/"):
    print("Model exists")
else:
    print("Downloading the model")
    model_repo = "DAMO-NLP-SG/Video-LLaMA-2-7B-Finetuned"
    
    files = list_repo_files(repo_id=model_repo)

    # Download all files
    for file in files:
       model_path = hf_hub_download(repo_id=model_repo, filename=file,cache_dir="./eval_configs/")
       print(f"Downloaded: {file} at {model_path}")
        
#getting and setting the paths
current_dir = os.getcwd()
model_path=current_dir+"/eval_configs/"

Model exists


### Step 3: Modify the YAML file✍🏻

To run the script, we need to modify the YAML file by adding the model paths and checkpoint paths.🎯

In [3]:
# Define the path to your YAML file
file_path = model_path+'video_llama_eval_withaudio.yaml'

# Load the YAML file safely
with open(file_path, 'r') as file:
    content = file.read()
    # print(content)  # Debugging step: to inspect YAML content
    data = yaml.safe_load(content)

# Function to recursively search for a key and replace its value
def find_and_replace(data, target, replacement):
    if isinstance(data, dict):
        for key, value in data.items():
            # Check if the key matches
            if key == target:
                print(f"Replacing {key}: {value} with {replacement}")  # Debug statement
                data[key] = str(replacement)
            else:
                find_and_replace(value, target, replacement)
    elif isinstance(data, list):
        for index, item in enumerate(data):
            find_and_replace(item, target, replacement)

# Define the variable you want to replace
target_variable = ['llama_model','imagebind_ckpt_path','ckpt','ckpt_2']  # The variable you want to find
new_value = ["models--DAMO-NLP-SG--Video-LLaMA-2-7B-Finetuned/snapshots/9d9519ffac4e48ef6510e829b1a1a643771a4dd0/llama-2-7b-chat-hf/",
             "models--DAMO-NLP-SG--Video-LLaMA-2-7B-Finetuned/snapshots/9d9519ffac4e48ef6510e829b1a1a643771a4dd0/",
            "models--DAMO-NLP-SG--Video-LLaMA-2-7B-Finetuned/snapshots/9d9519ffac4e48ef6510e829b1a1a643771a4dd0/VL_LLaMA_2_7B_Finetuned.pth",
            "models--DAMO-NLP-SG--Video-LLaMA-2-7B-Finetuned/snapshots/9d9519ffac4e48ef6510e829b1a1a643771a4dd0/AL_LLaMA_2_7B_Finetuned.pth"]  # The new value to replace the old variable


# Call the function to find and replace the variable
for key,val in zip(target_variable,new_value):
    # adding current path with the value
    find_and_replace(data,key, model_path+val)
    
    # Save the modified data back to the YAML file
    with open(file_path, 'w') as file:
        yaml.dump(data, file, default_flow_style=False)

# print(f"The variable '{target_variable}' has been replaced with '{new_value}'.")


Replacing llama_model: /home/u90dd6ca8f8647078b09809a2d4a416b/gi/video-analytics/eval_configs/models--DAMO-NLP-SG--Video-LLaMA-2-7B-Finetuned/snapshots/9d9519ffac4e48ef6510e829b1a1a643771a4dd0/llama-2-7b-chat-hf/ with /home/u90dd6ca8f8647078b09809a2d4a416b/gi/video-analytics/eval_configs/models--DAMO-NLP-SG--Video-LLaMA-2-7B-Finetuned/snapshots/9d9519ffac4e48ef6510e829b1a1a643771a4dd0/llama-2-7b-chat-hf/
Replacing imagebind_ckpt_path: /home/u90dd6ca8f8647078b09809a2d4a416b/gi/video-analytics/eval_configs/models--DAMO-NLP-SG--Video-LLaMA-2-7B-Finetuned/snapshots/9d9519ffac4e48ef6510e829b1a1a643771a4dd0/ with /home/u90dd6ca8f8647078b09809a2d4a416b/gi/video-analytics/eval_configs/models--DAMO-NLP-SG--Video-LLaMA-2-7B-Finetuned/snapshots/9d9519ffac4e48ef6510e829b1a1a643771a4dd0/
Replacing ckpt: /home/u90dd6ca8f8647078b09809a2d4a416b/gi/video-analytics/eval_configs/models--DAMO-NLP-SG--Video-LLaMA-2-7B-Finetuned/snapshots/9d9519ffac4e48ef6510e829b1a1a643771a4dd0/VL_LLaMA_2_7B_Finetuned.pth 

### Step 4: Setting model arguments📝
argparse.ArgumentParser(): Creates a new argument parser that will manage command-line arguments. The description parameter provides a brief description of the program that will be displayed when running the script with --help.

In [4]:
def parse_args():
    # Create an ArgumentParser object which will handle command-line arguments
    parser = argparse.ArgumentParser(description="Demo")
    
    # Add an argument for the configuration file path
    # The default value is set to './eval_configs/video_llama_eval_withaudio.yaml'
    # The help parameter provides a description for the argument in the command-line help message
    parser.add_argument("--cfg-path", 
                        default='./eval_configs/video_llama_eval_withaudio.yaml', 
                        help="path to configuration file.")
    
    # Add an argument to specify the GPU ID for model loading
    # It takes an integer value, with the default set to 0 (typically the first GPU)
    parser.add_argument("--gpu-id", 
                        type=int, 
                        default=0, 
                        help="specify the gpu to load the model.")
    
    # Add an argument to specify the type of LLM (Large Language Model) to use
    # The default is set to 'llama_v2', but it can be changed based on user preference
    parser.add_argument("--model_type", 
                        type=str, 
                        default='llama_v2', 
                        help="The type of LLM")
    
    # Add an argument that allows the user to override settings in the configuration file
    # 'nargs="+"' means the argument expects one or more values, provided as key-value pairs in the form of xxx=yyy
    # This feature is deprecated, and users should use --cfg-options instead, as noted in the help description
    parser.add_argument(
        "--options",
        nargs="+",  # Allow multiple values to be passed
        help="override some settings in the used config, the key-value pair "
             "in xxx=yyy format will be merged into config file (deprecated), "
             "change to --cfg-options instead.",
    )
    
    # Parse the arguments from the command line input
    # parse_known_args() allows parsing of arguments while ignoring any unknown ones (those not defined in the parser)
    args, unknown = parser.parse_known_args()
    
    # Return the parsed arguments as a Namespace object
    return args


### Step 5: Loading the model on Intel Max Series GPU🚀
".to("device")" moves the model to a specific hardware accelerator, in this case, 'xpu' (likely a custom device or accelerator)" refers to a specific operation in PyTorch (or other deep learning frameworks) used for transferring a model from one device (such as the CPU) to another device (such as a GPU, TPU, or other specialized hardware).

In [5]:
# Print a message indicating that the Chat initialization process has started
print('Initializing Chat')

# Parse command-line arguments to obtain configuration values
# This function likely handles command-line inputs to configure different parts of the application
args = parse_args()

# Initialize the configuration object using the parsed arguments
# This configuration will hold various settings related to the model, dataset, etc.
cfg = Config(args)

# Retrieve the model configuration from the loaded configuration object
# The 'model_cfg' section holds the configuration specific to the model's structure and behavior
model_config = cfg.model_cfg

# Fetch the model class based on the architecture specified in the configuration (model_config.arch)
# 'registry' here is presumably a dictionary or factory pattern that maps architecture names to model classes
model_cls = registry.get_model_class(model_config.arch)


##### Device selection and model setup
# Instantiate the model using the retrieved class and configuration
# 'model_cls.from_config()' sets up the model with the specified configuration, 
# and '.to('xpu')' moves the model to a specific hardware accelerator, in this case, 'xpu' (likely a custom device or accelerator).
model = model_cls.from_config(model_config).to('xpu')

# Set the model to evaluation mode
# In this mode, the model will not perform operations like dropout, which are only relevant during training
model.eval()

# Set the data type (precision) for mixed-precision computation
# 'amp_dtype = torch.bfloat16' sets the model to use bfloat16 precision, which is often used for faster computation
# and lower memory consumption on specialized hardware (like 'xpu') without sacrificing too much accuracy.
amp_dtype = torch.bfloat16 

# Optimize the model using Intel's low-level optimization library (IPEX) for large language models (LLMs)
# The model is set to evaluation mode, and the dtype (precision) and device ('xpu') are specified
# 'inplace=True' means the model will be modified directly (no need for creating a new object).
model = ipex.llm.optimize(model.eval(), dtype=amp_dtype, device="xpu", inplace=True) 

# Extract the visual processor configuration from the datasets configuration section in the config
# This configuration will contain the settings for processing video data (likely for multimodal models)
vis_processor_cfg = cfg.datasets_cfg.webvid.vis_processor.train

# Retrieve the processor class based on the configuration name and initialize it with the provided settings
# This class will likely handle video data (e.g., WebVid) during training or inference.
vis_processor = registry.get_processor_class(vis_processor_cfg.name).from_config(vis_processor_cfg)

# Initialize the Chat object with the configured model, visual processor, and device
# This will set up the chat system to process both text and visual data using the 'xpu' device.
chat = Chat(model, vis_processor, device='xpu')

# Print a message indicating that the initialization of Chat has been completed
print('Initialization Finished')

# Copy the initial chat state from a predefined state (likely a template or stored conversation)
# This 'chat_state' might be used to set the context of the chat system before interaction starts.
chat_state = conv_llava_llama_2.copy()


Initializing Chat
Loading VIT


2024-12-05 14:50:33,709 - root - INFO - freeze vision encoder
BertLMHeadModel has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.


Loading VIT Done
Loading Q-Former


2024-12-05 14:50:37,948 - root - INFO - load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth
2024-12-05 14:50:37,962 - root - INFO - freeze Qformer
2024-12-05 14:50:37,962 - root - INFO - Loading Q-Former Done
2024-12-05 14:50:37,963 - root - INFO - Loading LLAMA Tokenizer
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
2024-12-05 14:50:38,184 - root - INFO - Loading LLAMA Model
LlamaForCausalLM has generative capabi

Initializing audio encoder from /home/u90dd6ca8f8647078b09809a2d4a416b/gi/video-analytics/eval_configs/models--DAMO-NLP-SG--Video-LLaMA-2-7B-Finetuned/snapshots/9d9519ffac4e48ef6510e829b1a1a643771a4dd0/ ...
audio encoder initialized.


2024-12-05 14:51:01,365 - root - INFO - audio_Qformer and audio-LLAMA proj is frozen


Load first Checkpoint: /home/u90dd6ca8f8647078b09809a2d4a416b/gi/video-analytics/eval_configs/models--DAMO-NLP-SG--Video-LLaMA-2-7B-Finetuned/snapshots/9d9519ffac4e48ef6510e829b1a1a643771a4dd0/VL_LLaMA_2_7B_Finetuned.pth
Load second Checkpoint: /home/u90dd6ca8f8647078b09809a2d4a416b/gi/video-analytics/eval_configs/models--DAMO-NLP-SG--Video-LLaMA-2-7B-Finetuned/snapshots/9d9519ffac4e48ef6510e829b1a1a643771a4dd0/AL_LLaMA_2_7B_Finetuned.pth
Initialization Finished




### Step 6: Loading the video and Inferencing the model🚀
The following code demonstrates how to display a video and load it into the model. Additionally, we can modify the model's arguments within the chat.answer method.

In [6]:
# Initializing Chat and Image list as empty.
chat_state.system = ""
img_list = []

# Loading the video from a specific path.
video_path = "./examples/IronMan.mp4"

# Displaying the loaded video
display(Video(video_path, embed=True, width = 320, height = 240))

In [None]:
# Uploading video path, chat state and image list to the model
llm_message = chat.upload_video(video_path, chat_state, img_list)

while True:
    # Taking input from the User
    user_message = input("User: ")

    # Sending user query to the model 
    chat.ask(user_message, chat_state)
    
    start_time=time.time()

    # Setting model arguments
    llm_message = chat.answer(conv=chat_state,
                                  img_list=img_list,
                                  num_beams=1,
                                  temperature=0.15,
                                  max_new_tokens=300,
                                  repetition_penalty=1.0,
                                  # top_p=0.75,
                                  max_length=3000)[0]
    
    print(llm_message)
    end_time=time.time()
    print("Time taken: ",end_time-start_time,"\n")



./examples/IronMan.mp4
no audio is found


User:  video is about ?


The video is about a man wearing a red suit and standing in a room with a lot of machinery.
Time taken:  1.5802936553955078 



User:  summary of the video


The video shows a man wearing a red suit standing in a room with a lot of machinery. The man is wearing a red suit and has a red helmet on his head. He is standing in front of a lot of machinery and there are several other people in the room.
Time taken:  2.3918118476867676 



User:  spider man is there in this video ?


No, there is no spider man in the video.
Time taken:  0.6562099456787109 



User:  which marvel character is there in this video ?


There is no marvel character in the video.
Time taken:  0.7134983539581299 



User:  are you sure?


I apologize, I made a mistake. There is a marvel character in the video, it is Iron Man.
Time taken:  1.3483262062072754 

