<a href="https://colab.research.google.com/github/tomasndlate/thesis/blob/main/ThesisResearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set-up

Install required dependencies:

In [1]:
!pip install -U typing num2words decord transformers av accelerate git+https://github.com/huggingface/transformers@v4.49.0-SmolVLM-2

Collecting git+https://github.com/huggingface/transformers@v4.49.0-SmolVLM-2
  Cloning https://github.com/huggingface/transformers (to revision v4.49.0-SmolVLM-2) to /tmp/pip-req-build-weujp54k
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-weujp54k
  Running command git checkout -q 61e3ffd8148e68d879e3b2e1609fbb7d99621276
  Resolved https://github.com/huggingface/transformers to commit 61e3ffd8148e68d879e3b2e1609fbb7d99621276
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting transformers
  Using cached transformers-5.0.0-py3-none-any.whl.metadata (37 kB)


Login into Hugging Face API (using HF token saved in colab):

In [2]:
from huggingface_hub import login
from google.colab import userdata

token = userdata.get('HF_TOKEN')

login(token)

Import SmolVLM2-256M model:

In [3]:
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "HuggingFaceTB/SmolVLM2-256M-Video-Instruct"

processor = AutoProcessor.from_pretrained(model_id)

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    attn_implementation="eager",
    device_map="auto"
)

print("Model loaded successfully")

processor_config.json:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/430 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/868 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/136 [00:00<?, ?B/s]

Model loaded successfully


Define the needed data:

In [7]:
image_path = "/content/drive/MyDrive/ThesisResearch/dmd/test.png"
video_path = "/content/drive/MyDrive/ThesisResearch/dmd/gA/3/s1/gA_3_s1_2019-03-08T10;27;38+01;00_ir_body.mp4"

Custom predict function:

In [9]:
from typing import Literal

def predict_with_model(
    media_type: Literal["video", "image"],
    media_path: str,
    prompt: str,
    model,
    processor,
    max_new_tokens: int =150
    ):
    messages = [
        {
            "role": "user",
            "content": [
                {"type": media_type, "path": media_path},
                {"type": "text", "text": prompt}
            ]
        }
    ]

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt"
    ).to("cuda", torch.bfloat16)

    with torch.no_grad():
        output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)

    response = processor.batch_decode(output_ids, skip_special_tokens=True)
    return response[0]

# Experiment 1: Zero-Shot Prompting (Gaze Direction)

### Image prediction

In [10]:
experiment_prompt = "Where is the person looking to in this image?"
predicted_response = predict_with_model("image", image_path, experiment_prompt, model, processor)
print(predicted_response)

User:



Where is the person looking to in this image?
Assistant: The person is looking to the left side of the image.


### Video prediction

# Experiment 2: One-Shot Prompting (Distraction Detection)

### Image prediction

In [12]:
example_image = image_path
inference_image = image_path

experiment_prompt = f"""
  -User: You are a driver monitoring system that is responsible for assuring
   the driver is driving safely and alert when they are distracted. What is
   the state of this driver? {example_image}

  -Assistant: This driver is distracted because he is having a phonecall while driving

  -User: And how about this driver? {inference_image}
"""

predicted_response = predict_with_model("image", image_path, experiment_prompt, model, processor)
print(predicted_response)

User:




  -User: You are a driver monitoring system that is responsible for assuring 
   the driver is driving safely and alert when they are distracted. What is
   the state of this driver? /content/drive/MyDrive/ThesisResearch/dmd/test.png

  -Assistant: This driver is distracted because he is having a phonecall while driving

  -User: And how about this driver? /content/drive/MyDrive/ThesisResearch/dmd/test.png

Assistant: This driver is driving and has a car in the background


### Video prediction

# Experiment 3: Structured Output (Code-Format)

### Image prediction

One-shot + Output formatted:

In [14]:
example_image = image_path
inference_image = image_path

experiment_prompt = f"""
  -User: You are a driver monitoring system that is responsible for
   assuring the driver is driving safely and alert when they are distracted.
   You need to communicate with the HMI to alert the driver, please provide
  the following variables with True or False: Distracted, Talking, Using
phone. What is the state of this driver? {example_image}

  -Assistant: Distracted = True, Talking = No, Using phone=No

  -User: And how about this driver? {inference_image}
"""

predicted_response = predict_with_model("image", image_path, experiment_prompt, model, processor)
print(predicted_response)

User:




  -User: You are a driver monitoring system that is responsible for
   assuring the driver is driving safely and alert when they are distracted.
   You need to communicate with the HMI to alert the driver, please provide
  the following variables with True or False: Distracted, Talking, Using
phone. What is the state of this driver? /content/drive/MyDrive/ThesisResearch/dmd/test.png

  -Assistant: Distracted = True, Talking = No, Using phone=No

  -User: And how about this driver? /content/drive/MyDrive/ThesisResearch/dmd/test.png

Assistant: Distracted = No, Talking = Yes


### Video prediction

# Evaluation and Challenges