
### 1. MSR-VTT (Microsoft Research Video to Text)
[HuggingFace](https://huggingface.co/datasets/friedrichor/MSR-VTT)

Большой датасет с разнообразными видео из 20 категорий. Можно взять за основу. Описания видео кратко сформулорованы, в трейн часте много разных фариантов описаний каждого видео (перефразировки). Длительность клипов 10–30 c

```json
{
  "video_id": "video0",
  "video": "video0.mp4",
  "category_map": {
    "0": "music",
    "1": "people",
    "2": "gaming",
    "3": "sports",
    "4": "news",
    "5": "education",
    "6": "TV shows",
    "7": "movie",
    "8": "animation",
    "9": "vehicles",
    "10": "travel",
    "11": "science",
    "12": "animals",
    "13": "kids",
    "14": "food",
    "15": "cooking",
    "16": "beauty",
    "17": "fashion",
    "18": "documentary",
    "19": "ads"
  }
  "url": "https://www.youtube.com/watch?v=...",
  "start_time": 10.0,
  "end_time": 30.0,
  "caption": "a girl is playing a guitar",
  "sen_id": 1500
}
```

Train:
* train_7k: 7,010 videos, 140,200 captions
* train_9k: 9,000 videos, 180,000 captions

Test:
* test_1k: 1,000 videos, 1,000 captions

***

### 2. VATEX (In-the-Wild Multilingual)
Чистая версия: [HuggingFace](https://huggingface.co/datasets/VLM2Vec/VATEX). Оригинальная версия: [HuggingFace](https://huggingface.co/datasets/HuggingFaceM4/vatex)

Все видео про людей (Human Actions). Каждый элемент содержит 10 описаний (в оригинальном варианте есть английские и китайские варианты описаний).

```json
{
  "videoID": "V_001",
  "enCap": [
    "A person is doing pushups on a floor with good form...",
    "Man exercising in a gym doing pushups...",
    ...
  ],
}
```

* test: 4,480 videos

***

### 3. YouCook2 (Instructional / Procedural)
[HuggingFace](https://huggingface.co/datasets/lmms-lab/YouCook2)

Видео про кулинарию, но можно взять небольшую часть. Некоторые видео слишком длинные, берем только короткие. Только один вариант описения

```json
{
  "id": "xHr8X2Wpmno_0",
  "video_url": "https://www.youtube.com/watch?v=...",
  "recipe_type": 266,
  "segment": [47, 60],
  "sentence": "pick the ends off the verdalago",
  "video_path": "val/xHr8X2Wpmno_0.mp4",
  "youtube_id": "xHr8X2Wpmno",
}
```

* val: 3,180 videos
* test: 1,470 videos

***

### Отклоненные датасеты

1.  **ActivityNet Captions** [HuggingFace](https://huggingface.co/datasets/HuggingFaceM4/ActivitiyNet_Captions)
    *   Слишком длинные видео.
2.  **HowTo100M** [HuggingFace](https://huggingface.co/datasets/HuggingFaceM4/howto100m)
    *   Подходит только для Pre-training. Данные слишком грязные
3.  **DiDeMo** [HuggingFace](https://huggingface.co/datasets/friedrichor/DiDeMo)
    *   Слишком длинные видео. 
4.  **Kinetics-400/600** [Kaggle](https://www.kaggle.com/datasets/rohanmallick/kinetics-train-5per)
    *   500k видно, но без описания


In [None]:
from pathlib import Path
import numpy as np

from src.data_loader import load_dataset_embeddings
from src.clustering import evaluate_clustering
from src.retrieval import evaluate_retrieval
from src.visualization import compute_rank

DIRS = {
    "xclip": "clip",
    "siglip": "clip",
    "qwen3": "vlm",
    "videomae": "video-mae"
}

DATASETS = ["msrvtt", "vatex", "youcook2"]


EMBED_DIR = "./embedings/"
VLM_DIR = EMBED_DIR + "vlm/"
CLIP_DIR = EMBED_DIR + "clip/"
MAE_DIR = EMBED_DIR + "video-mae/"
embed_dir = CLIP_DIR
dataset = "msrvtt"
model = "xclip"

def whatup(model, dataset):
    embed_dir = EMBED_DIR + DIRS[model] + '/'
    
    video_embs, text_embs, categories, df = load_dataset_embeddings(embed_dir, dataset, model)
    
    # print(f"Video embeddings: {video_embs.shape if video_embs is not None else None}")
    # print(f"Text embeddings: {text_embs.shape if text_embs is not None else None}")
    
    # print(f"\n{'='*60}")
    print(f"Dataset: {dataset.upper()} | Model: {model.upper()}")
    # print('='*60)
    
    # print("\n--- Embedding Space Analysis ---")
    # if video_embs is not None:
    #     rank = compute_rank(video_embs)
    #     print(f"Matrix rank (video): {rank}/{video_embs.shape[1]}")
    # if text_embs is not None:
    #     rank_text = compute_rank(text_embs)
    #     print(f"Matrix rank (text): {rank_text}/{text_embs.shape[1]}")

    print("\n--- Clustering Metrics ---")
    if categories is not None:
        metrics, n_clusters_used = evaluate_clustering(video_embs, categories, n_clusters=10)
        print(f"Number of clusters: {n_clusters_used}")
        print("Supervised metrics:")
        for metric in ["ARI", "NMI"]:
            if metric in metrics:
                print(f"  {metric}: {metrics[metric]:.4f}")
        print("Unsupervised metrics:")
        for metric in ["Silhouette", "Calinski-Harabasz", "Davies-Bouldin"]:
            if metric in metrics:
                print(f"  {metric}: {metrics[metric]:.4f}")
    else:
        if n_clusters is None:
            print("Warning: No ground truth labels. Specify --n_clusters for unsupervised evaluation.")
        else:
            metrics, n_clusters_used = evaluate_clustering(video_embs, categories=None, 
                                                          n_clusters=n_clusters)
            print(f"Number of clusters: {n_clusters_used}")
            print("Unsupervised metrics:")
            for metric, value in metrics.items():
                print(f"  {metric}: {value:.4f}")
    print()

for model in ["xclip", "siglip", "qwen3", "videomae"]:
    for dataset in ["vatex", "msrvtt", "youcook2"]:
        whatup(model, dataset)

Dataset: VATEX | Model: XCLIP

--- Clustering Metrics ---
Number of clusters: 10
Supervised metrics:
Unsupervised metrics:
  Silhouette: 0.0062
  Calinski-Harabasz: 12.2073
  Davies-Bouldin: 5.3613

Dataset: MSRVTT | Model: XCLIP

--- Clustering Metrics ---
Number of clusters: 20
Supervised metrics:
  ARI: 0.2034
  NMI: 0.3818
Unsupervised metrics:
  Silhouette: 0.0302
  Calinski-Harabasz: 12.0438
  Davies-Bouldin: 4.0354

Dataset: YOUCOOK2 | Model: XCLIP

--- Clustering Metrics ---
Number of clusters: 10
Supervised metrics:
Unsupervised metrics:
  Silhouette: 0.0649
  Calinski-Harabasz: 35.4584
  Davies-Bouldin: 2.8071

Dataset: VATEX | Model: SIGLIP

--- Clustering Metrics ---
Number of clusters: 10
Supervised metrics:
Unsupervised metrics:
  Silhouette: 0.0387
  Calinski-Harabasz: 22.1261
  Davies-Bouldin: 3.6919

Dataset: MSRVTT | Model: SIGLIP

--- Clustering Metrics ---
Number of clusters: 20
Supervised metrics:
  ARI: 0.2234
  NMI: 0.4079
Unsupervised metrics:
  Silhouette: 0.04