After thoroughly researching, I found that **Video-LLaMA2** is the most suitable model for my project for several key reasons:

---

### **1. Comprehensive Multimodal Embedding Capabilities**
Video-LLaMA2 excels in processing video, audio, and text modalities together, which aligns perfectly with my project’s requirements. The model’s ability to fuse these diverse inputs into unified embeddings provides rich representations that can be directly applied to tasks like price prediction and recommendation systems.

---

### **2. Pre-Trained on Large-Scale Datasets**
The model has been pre-trained on extensive datasets such as WebVid-2M and LLaVA, covering millions of video-text and image-caption pairs. This robust pre-training ensures that it can handle complex video and audio features effectively, even for diverse and unstructured inputs.

---

### **3. Instruction Fine-Tuning**
Video-LLaMA2 has been fine-tuned on datasets like MiniGPT-4, LLaVA, and VideoChat, which enhance its ability to follow human-like instructions and adapt to specific tasks. This fine-tuning is particularly valuable for customizing the model to the project’s goals, such as extracting specific product information or interpreting price-related data in videos.

---

### **4. Flexibility and Customization**
The availability of both pre-trained and fine-tuned checkpoints for multiple versions (7B and 13B) allows me to choose the most appropriate configuration based on the project’s complexity and available resources. Additionally, it offers the flexibility to fine-tune the model further, ensuring it can align with niche requirements if needed.

---

### **5. Ability to Process Videos with or Without Audio**
The model’s dual branches—Vision-Language (VL) and Audio-Language (AL)—ensure that it can handle a variety of scenarios, such as:
- Videos with rich audio content.
- Silent videos where visual context alone is critical.

This versatility enhances its applicability to diverse datasets and ensures that no data is wasted, regardless of its format.

---

### **6. Cost-Effectiveness with Palmetto Access**
By running Video-LLaMA2 on the Palmetto Cluster, I avoid the high costs associated with cloud-based platforms like Google AI Studio. Leveraging Palmetto’s GPU and memory resources allows me to run the model efficiently without budget concerns.

---

### **7. Open-Source and Research-Focused**
As an open-source model, Video-LLaMA2 is freely available for research purposes, aligning perfectly with the academic goals of the project. Its strong documentation and active community support also make it easier to set up and troubleshoot.

---

### **Conclusion**
In summary, I found Video-LLaMA2 to be the most practical and powerful tool for my project. Its comprehensive multimodal capabilities, cost-effectiveness, and alignment with academic use cases make it the ideal choice for extracting rich embeddings from video and audio content, and ultimately, for achieving the project’s goals. This approach also ensures scalability and flexibility for future refinements.

link : https://github.com/DAMO-NLP-SG/Video-LLaMA

In [1]:
import os

# Directories for videos, audio, and embeddings
videos_dir = './videos/'
audio_dir = './audio/'
embeddings_dir = './embeddings/'

# Create the directories if they do not exist
os.makedirs(videos_dir, exist_ok=True)
os.makedirs(audio_dir, exist_ok=True)
os.makedirs(embeddings_dir, exist_ok=True)

print("Directories set up.")


Directories set up.


In [3]:
print(video_urls_df['video_urls'])


0    https://m.media-amazon.com/images/S/vse-vms-tr...
1    https://m.media-amazon.com/images/S/vse-vms-tr...
2    https://m.media-amazon.com/images/S/vse-vms-tr...
3    https://m.media-amazon.com/images/S/vse-vms-tr...
4    https://m.media-amazon.com/images/S/vse-vms-tr...
5    https://m.media-amazon.com/images/S/vse-vms-tr...
6    https://m.media-amazon.com/images/S/vse-vms-tr...
7    https://m.media-amazon.com/images/S/vse-vms-tr...
8    https://m.media-amazon.com/images/S/vse-vms-tr...
9    https://m.media-amazon.com/images/S/vse-vms-tr...
Name: video_urls, dtype: object


In [5]:
# Clean up the URLs and take only the first part
video_urls_df['video_urls'] = video_urls_df['video_urls'].str.split(';').str[0].str.strip()
