##Cell 1: Install Dependencies & Import Modules
Install all required Python libraries (Ultralytics YOLOv8, PyTorch, FAISS, Transformers, OpenCV, etc.), import standard and third‐party modules, and configure logging. This ensures your environment has everything needed for video processing, object detection, CLIP embeddings, similarity search, and NLP tasks.



In [1]:
# Install required packages
!pip install --quiet ultralytics torch torchvision
!pip install --quiet ftfy regex tqdm pandas numpy
!pip install --quiet sentence-transformers transformers faiss-cpu
!pip install --quiet spacy fastapi uvicorn python-multipart
!pip install --quiet Pillow opencv-python
!pip install --quiet gdown openpyxl

# Import standard modules
import os
import json
import logging
from pathlib import Path

import cv2
import pandas as pd
import numpy as np
from PIL import Image
import requests

import torch
from ultralytics import YOLO
import faiss
from transformers import CLIPProcessor, CLIPModel
from transformers import pipeline as hf_pipeline

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')

# Sanity checks
print("Torch version:", torch.__version__)
print("FAISS version:", faiss.__version__)


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m97.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m79.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

##Cell 2: Unzip Videos & Create Folders
Unpack the uploaded ZIP archive of videos into a nested data/videos/videos/ directory and create the working folders:

1.catalog_images/ for downloaded product thumbnails

2.frames/ for extracted video frames

3.detections/ for YOLO output JSONs

4.crops/ for cropped product images

5.outputs/ for the final per‐video JSON results

6.models/ for downloaded model weights

7.Verify that the .mp4 files were extracted correctly.

In [2]:

# Cell 2: Unzip `videos-20250605T140122Z-1-001.zip` and create necessary folders─
!ls /content

# 2) Define root paths
DATA_ROOT  = Path("/content/data")
VIDEOS_ZIP = Path("/content/videos-20250605T140122Z-1-001.zip")  # <<< Change if your ZIP has a different name

assert VIDEOS_ZIP.exists(), f"ERROR: {VIDEOS_ZIP} not found. Make sure you uploaded it."

# 3) Unzip into data/videos/videos/ (one extra layer)
DEST_VIDEOS = DATA_ROOT / "videos" / "videos"
os.makedirs(DEST_VIDEOS, exist_ok=True)
!unzip -qo "{VIDEOS_ZIP}" -d "{DATA_ROOT / 'videos'}"

# 4) Create the other required directories
CATALOG_IMG_DIR = DATA_ROOT / "catalog_images"
OUTPUT_DIR      = DATA_ROOT / "outputs"
FRAMES_DIR      = DATA_ROOT / "frames"
DETECTIONS_DIR  = DATA_ROOT / "detections"
CROPS_DIR       = DATA_ROOT / "crops"
MODELS_DIR      = Path("/content/models")

for folder in [CATALOG_IMG_DIR, OUTPUT_DIR, FRAMES_DIR, DETECTIONS_DIR, CROPS_DIR, MODELS_DIR]:
    os.makedirs(folder, exist_ok=True)

# 5) Confirm that videos are present under data/videos/videos/
print("Videos extracted:")
print(sorted(os.listdir(DEST_VIDEOS)))


images.csv	   README.pdf	vibeslist.json
product_data.xlsx  sample_data	videos-20250605T140122Z-1-001.zip
Videos extracted:
['2025-05-22_08-25-12_UTC.jpg', '2025-05-22_08-25-12_UTC.json.xz', '2025-05-22_08-25-12_UTC.mp4', '2025-05-22_08-25-12_UTC.txt', '2025-05-27_13-46-16_UTC.jpg', '2025-05-27_13-46-16_UTC.json.xz', '2025-05-27_13-46-16_UTC.mp4', '2025-05-27_13-46-16_UTC.txt', '2025-05-28_13-40-09_UTC.jpg', '2025-05-28_13-40-09_UTC.json.xz', '2025-05-28_13-40-09_UTC.mp4', '2025-05-28_13-40-09_UTC.txt', '2025-05-28_13-42-32_UTC.jpg', '2025-05-28_13-42-32_UTC.json.xz', '2025-05-28_13-42-32_UTC.mp4', '2025-05-28_13-42-32_UTC.txt', '2025-05-31_14-01-37_UTC.jpg', '2025-05-31_14-01-37_UTC.json.xz', '2025-05-31_14-01-37_UTC.mp4', '2025-05-31_14-01-37_UTC.txt', '2025-06-02_11-31-19_UTC.jpg', '2025-06-02_11-31-19_UTC.json.xz', '2025-06-02_11-31-19_UTC.mp4', '2025-06-02_11-31-19_UTC.txt']


##Cell 3: Load & Merge Catalog Data
Load your product catalog from product_data.xlsx and the corresponding image URLs from images.csv. Perform an inner join on the id column to produce a single DataFrame with all product metadata and download links. Also load the list of “vibes” from vibeslist.json for later classification.

In [3]:
# Paths to catalog files
CATALOG_XLSX = DATA_ROOT / "/content/product_data.xlsx"
IMAGES_CSV   = DATA_ROOT / "/content/images.csv"
VIBES_JSON   = DATA_ROOT / "/content/vibeslist.json"

# Check existence
assert CATALOG_XLSX.exists(), "Catalog Excel file not found."
assert IMAGES_CSV.exists(),   "Images CSV file not found."
assert VIBES_JSON.exists(),   "Vibes JSON file not found."

# Load catalog and inspect
catalog_df = pd.read_excel(CATALOG_XLSX, engine="openpyxl")
print("Catalog columns:", catalog_df.columns.tolist())

# Load images.csv
images_df = pd.read_csv(IMAGES_CSV)
print("Images CSV columns:", images_df.columns.tolist())

# Merge on 'id'
assert "id" in catalog_df and "id" in images_df, "Column 'id' missing."
assert "image_url" in images_df, "Column 'image_url' missing."
merged_df = catalog_df.merge(images_df[["id", "image_url"]], on="id", how="inner")
print("Merged catalog rows:", len(merged_df))

# Load vibes list
with open(VIBES_JSON) as f:
    vibes_list = json.load(f)
print("Vibes loaded:", vibes_list)


Catalog columns: ['id', 'title', 'description', 'product_type', 'alias', 'mrp', 'price_display_amount', 'discount_percentage', 'product_tags', 'product_collections']
Images CSV columns: ['id', 'image_url']
Merged catalog rows: 9172
Vibes loaded: ['Coquette', 'Clean Girl', 'Cottagecore', 'Streetcore', 'Y2K', 'Boho', 'Party Glam']


##Cell 4: Download Catalog Images
Iterate over each row of the merged catalog DataFrame, download the product thumbnail image from its URL, and save it locally in catalog_images/{id}.jpg. Report any download failures and summarize how many images were saved successfully.

In [4]:
def download_catalog_images(df, id_col, url_col, target_dir):
    target_dir.mkdir(exist_ok=True)
    failures = []
    for _, row in df.iterrows():
        pid, url = str(row[id_col]), row[url_col]
        out_path = target_dir / f"{pid}.jpg"
        if out_path.exists():
            continue
        try:
            resp = requests.get(url, timeout=10)
            if resp.status_code == 200:
                out_path.write_bytes(resp.content)
            else:
                failures.append(pid)
        except:
            failures.append(pid)
    return failures

failures = download_catalog_images(merged_df, "id", "image_url", CATALOG_IMG_DIR)
print("Download failures:", failures[:5])
print("Total images:", len(os.listdir(CATALOG_IMG_DIR)))


Download failures: ['14983', '14983', '14983', '14983', '14983']
Total images: 967


##Cell 5: Frame Extraction
Define and test a function that opens each video (.mp4), samples one frame every fixed interval (e.g. 0.5 seconds), and writes the frames as JPEGs into frames/{video_id}/. This provides the still images on which object detection will run.



In [5]:
def extract_frames(video_path: Path, dest_dir: Path, interval_sec=0.5):
    dest_dir.mkdir(exist_ok=True)
    cap = cv2.VideoCapture(str(video_path))
    fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
    step = int(fps * interval_sec)
    idx, saved = 0, 0
    while True:
        ret, frame = cap.read()
        if not ret:
            break
        if idx % step == 0:
            out_file = dest_dir / f"frame_{saved:05d}.jpg"
            cv2.imwrite(str(out_file), frame)
            saved += 1
        idx += 1
    cap.release()
    return saved

print("Sample extraction:", extract_frames(list(DEST_VIDEOS.glob("*.mp4"))[0], FRAMES_DIR / "test_video"))


Sample extraction: 23


##Cell 6: YOLOv8 Object Detection
Download and load the pretrained YOLOv8n model weights. Define a function that runs YOLOv8 on every extracted frame in a given folder, collects all bounding‐box detections (class, confidence, coordinates), and writes them to a JSON file detections/{video_id}.json.

In [6]:
# Download and load pretrained weights
YOLO_WEIGHTS = MODELS_DIR / "yolov8n.pt"
if not YOLO_WEIGHTS.exists():
    !curl -L -o "{YOLO_WEIGHTS}" \
      https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8n.pt

model = YOLO(str(YOLO_WEIGHTS))

def detect_objects(frames_dir: Path, output_json: Path):
    detections = []
    for frame_path in sorted(frames_dir.glob("*.jpg")):
        results = model(str(frame_path))[0]
        for box in results.boxes:
            x1, y1, x2, y2 = box.xyxy.tolist()[0]
            cls_id, conf = int(box.cls[0]), float(box.conf[0])
            detections.append({
                "frame": frame_path.name,
                "class": model.names[cls_id],
                "bbox": [x1, y1, x2-x1, y2-y1],
                "confidence": conf
            })
    output_json.write_text(json.dumps(detections, indent=2))
    return len(detections)


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 6381k  100 6381k    0     0  6604k      0 --:--:-- --:--:-- --:--:-- 11.0M


##Cell 7: Crop Detected Items
Read the YOLO detection JSON, group detections by frame, and for each bounding box crop out that region of the frame. Save each crop in crops/{video_id}/ and record minimal metadata (crop path and detected class) for further processing.

In [7]:
def crop_items(frames_dir: Path, detections_json: Path, crops_dest: Path):
    crops_dest.mkdir(exist_ok=True)
    dets = json.loads(detections_json.read_text())
    by_frame = {}
    for d in dets:
        by_frame.setdefault(d["frame"], []).append(d)

    metadata = []
    for frame_name, recs in by_frame.items():
        img = cv2.imread(str(frames_dir / frame_name))
        for i, r in enumerate(recs):
            x, y, w, h = map(int, r["bbox"])
            crop = img[y:y+h, x:x+w]
            if crop.size == 0:
                continue
            out_file = crops_dest / f"{frames_dir.name}_{frame_name[:-4]}_{i:03d}.jpg"
            cv2.imwrite(str(out_file), crop)
            metadata.append({"crop_path": str(out_file), "type": r["class"]})
    return metadata


##Cell 8: CLIP Embeddings & FAISS Index
Load the CLIP image encoder and processor. For every downloaded product thumbnail in catalog_images/, compute a 512-dimensional unit-normalized embedding. Build a FAISS IndexFlatIP index over these embeddings to enable fast nearest-neighbor search based on cosine similarity.

In [8]:
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model     = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").eval().to(device)
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def compute_clip_embedding(image_path):
    img = Image.open(image_path).convert("RGB")
    inputs = clip_processor(images=img, return_tensors="pt").to(device)
    with torch.no_grad():
        emb = clip_model.get_image_features(**inputs)
    emb = emb / emb.norm(p=2, dim=-1, keepdim=True)
    return emb.cpu().numpy().flatten().astype("float32")

catalog_ids = merged_df["id"].astype(str).tolist()
embs = np.zeros((len(catalog_ids), 512), dtype="float32")
for idx, pid in enumerate(catalog_ids):
    p = CATALOG_IMG_DIR / f"{pid}.jpg"
    if p.exists():
        embs[idx] = compute_clip_embedding(str(p))

faiss_index = faiss.IndexFlatIP(512)
faiss_index.add(embs)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

##Cell 9: Match Crops to Catalog
Define a function that takes a crop image, computes its CLIP embedding, queries the FAISS index for the top match, and classifies the match as “exact,” “similar,” or “forced” based on similarity thresholds (e.g. 0.90 and 0.75). Always returns a catalog product ID, even if confidence is low.

In [9]:
EXACT_THRESH, SIMILAR_THRESH = 0.90, 0.75

def match_crop_to_catalog(crop_path: str):
    emb = compute_clip_embedding(crop_path).reshape(1, -1)
    D, I = faiss_index.search(emb, k=1)
    sim, idx = float(D[0][0]), int(I[0][0])
    pid = catalog_ids[idx]
    if sim >= EXACT_THRESH:
        mtype = "exact"
    elif sim >= SIMILAR_THRESH:
        mtype = "similar"
    else:
        mtype = "forced"
    return pid, mtype, sim


##Cell 10: Vibe Classification
Implement two methods for determining a video’s “vibes” from its captions or hashtags:

1.Rule‐based: Count predefined keywords mapping to each vibe category and pick the top matches.

2.Zero‐shot NLP: Use a pretrained BART-MNLI model to assign vibe labels without further training.



In [10]:
keyword_map = {
    "lace": "Coquette", "floral": "Cottagecore", "denim": "Streetcore",
    "vintage": "Y2K",   "boho": "Boho",             "glam": "Party Glam",
    "minimal": "Clean Girl"
}

def classify_vibe_rule(text):
    text = text.lower()
    scores = {v:0 for v in vibes_list}
    for kw, vibe in keyword_map.items():
        if kw in text:
            scores[vibe] += 1
    return [v for v, sc in sorted(scores.items(), key=lambda x:-x[1]) if sc>0][:3]

zero_shot_vibe = hf_pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

def classify_vibe_zero_shot(text):
    out = zero_shot_vibe(text, vibes_list, multi_label=True)
    return out["labels"][:3]


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


##Cell 11: Build Final JSON for One Video
Combine everything into a single data structure for one video:

Vibes: Apply rule‐based or zero‐shot classification to its captions/hashtags (optional).

Products: For each crop in crops/{video_id}/, call the matching function to assign a catalog ID, match type, and confidence.

Return a dictionary with keys "video_id", "vibes", and "products".



In [11]:
def extract_dominant_color_dummy(_):
    return None

def build_video_json(video_id, use_zero_shot=True, captions_df=None):
    # Vibes
    if captions_df is not None and video_id in captions_df["video_id"].astype(str).tolist():
        txt = captions_df.loc[captions_df["video_id"].astype(str)==video_id, "caption_text"].iloc[0]
        tags = captions_df.loc[captions_df["video_id"].astype(str)==video_id, "hashtags"].iloc[0]
        text = f"{txt} {tags}"
        vibes = classify_vibe_zero_shot(text) if use_zero_shot else classify_vibe_rule(text)
    else:
        vibes = []

    # Products
    products = []
    for crop_file in sorted((CROPS_DIR / video_id).glob("*.jpg")):
        pid, mtype, conf = match_crop_to_catalog(str(crop_file))
        products.append({
            "type": crop_file.name.split("_")[1],
            "color": extract_dominant_color_dummy(str(crop_file)),
            "matched_product_id": pid,
            "match_type": mtype,
            "confidence": round(conf, 4)
        })

    return {"video_id": video_id, "vibes": vibes, "products": products}


##Cell 12: Full Pipeline Loop
Run the end-to-end pipeline for every .mp4 in your videos folder:

###Extract frames

###Run YOLOv8 detection

###Crop detected items

###Match crops to catalog

###Classify vibes

###Save the final JSON to outputs/{video_id}.json

###Print a confirmation for each video processed.

In [12]:
import pandas as pd

# Optional: load captions.csv if available
captions_csv = DATA_ROOT / "captions.csv"
captions_df = pd.read_csv(captions_csv) if captions_csv.exists() else None

def process_video(video_path, interval_sec=0.5):
    vid = video_path.stem
    extract_frames(video_path, FRAMES_DIR/vid, interval_sec)
    detect_objects(FRAMES_DIR/vid, DETECTIONS_DIR/f"{vid}.json")
    crop_items(FRAMES_DIR/vid, DETECTIONS_DIR/f"{vid}.json", CROPS_DIR/vid)
    result = build_video_json(vid, use_zero_shot=True, captions_df=captions_df)
    with open(OUTPUT_DIR/f"{vid}.json", "w") as f:
        json.dump(result, f, indent=2)
    print(f"Saved output for {vid}")

# Process all videos
for video_file in sorted(DEST_VIDEOS.glob("*.mp4")):
    process_video(video_file)



image 1/1 /content/data/frames/2025-05-22_08-25-12_UTC/frame_00000.jpg: 640x384 (no detections), 289.6ms
Speed: 30.7ms preprocess, 289.6ms inference, 32.0ms postprocess per image at shape (1, 3, 640, 384)

image 1/1 /content/data/frames/2025-05-22_08-25-12_UTC/frame_00001.jpg: 640x384 (no detections), 161.4ms
Speed: 3.0ms preprocess, 161.4ms inference, 1.2ms postprocess per image at shape (1, 3, 640, 384)

image 1/1 /content/data/frames/2025-05-22_08-25-12_UTC/frame_00002.jpg: 640x384 (no detections), 144.3ms
Speed: 3.1ms preprocess, 144.3ms inference, 1.2ms postprocess per image at shape (1, 3, 640, 384)

image 1/1 /content/data/frames/2025-05-22_08-25-12_UTC/frame_00003.jpg: 640x384 (no detections), 147.5ms
Speed: 3.0ms preprocess, 147.5ms inference, 1.1ms postprocess per image at shape (1, 3, 640, 384)

image 1/1 /content/data/frames/2025-05-22_08-25-12_UTC/frame_00004.jpg: 640x384 (no detections), 149.8ms
Speed: 3.2ms preprocess, 149.8ms inference, 1.2ms postprocess per image at s

##Cell 13: Verify Outputs
List all generated JSON files in outputs/ and display the contents of the first one. This final sanity check confirms that your pipeline has produced correctly structured, per-video JSON results.

In [13]:
outputs = sorted(OUTPUT_DIR.glob("*.json"))
print("Generated JSON files:", [p.name for p in outputs])

if outputs:
    sample = json.loads(outputs[0].read_text())
    print("Sample JSON:", json.dumps(sample, indent=2))
else:
    print("No JSON files found in outputs.")


Generated JSON files: ['2025-05-22_08-25-12_UTC.json', '2025-05-27_13-46-16_UTC.json', '2025-05-28_13-40-09_UTC.json', '2025-05-28_13-42-32_UTC.json', '2025-05-31_14-01-37_UTC.json', '2025-06-02_11-31-19_UTC.json']
Sample JSON: {
  "video_id": "2025-05-22_08-25-12_UTC",
  "vibes": [],
  "products": [
    {
      "type": "08-25-12",
      "color": null,
      "matched_product_id": "97543",
      "match_type": "forced",
      "confidence": 0.6278
    },
    {
      "type": "08-25-12",
      "color": null,
      "matched_product_id": "97543",
      "match_type": "forced",
      "confidence": 0.6164
    },
    {
      "type": "08-25-12",
      "color": null,
      "matched_product_id": "97543",
      "match_type": "forced",
      "confidence": 0.6205
    }
  ]
}


In [14]:
from google.colab import files
files.download("/content/data/outputs/2025-06-02_11-31-19_UTC.json")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>