![Cosmos-Reason1-7B](cosmos-reason1_banner.png)

**Cosmos-Reason1** is a suite of models, tools, and benchmarks designed to enable multimodal large language models (LLMs) to reason with physical common sense and generate grounded responses. This notebook helps you set up the environment and demonstrates two example inference use cases.

The following steps are based on [Github: Cosmos-Reason1-7B](https://github.com/nvidia-cosmos/cosmos-reason1/tree/main)
- Tested Commits:
    - GitHub Commit ID: 98ebb68dfe2aaa1af2c78308e774b65f6b0ecd3d
    - Huggingface Commit ID: 8fe96c1fa10db9e666b6fa6a87fea57dd9635649

### Create the requirements.txt
---
This creates a requirements.txt file with all the necessary Python packages needed to run Cosmos-Reason1 models and examples.

In [None]:
%%writefile requirements.txt
accelerate
qwen-vl-utils
rich
torch
torchcodec
torchvision
transformers>=4.51.3
vllm

ipykernel
ipywidgets
huggingface_hub

### Setup Environment and Dependencies
---
Execute the following commands in a terminal. To open a terminal: Launcher tab -> Other -> Terminal
 
```bash
# Download the sample video file
wget https://github.com/nvidia-cosmos/cosmos-reason1/raw/98ebb68dfe2aaa1af2c78308e774b65f6b0ecd3d/assets/sample.mp4
 
# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
 
# Login your Huggingface account to download checkpoints later
# Get your access token here: https://huggingface.co/settings/tokens
uv tool install -U "huggingface_hub[cli]"
hf auth login
 
# Create a python virtual environment and install dependencies
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
# Create a pyhton kernel for the notebook
python -m ipykernel install --user --name=reason1 --display-name "Python (.venv) Reason1"
 
# Restart the python venv
deactivate
source .venv/bin/activate
```

### Switch to the Custom Python Kernel
---
1. Go back to the notebook: *cosmos-reason1.ipynb*
2. click on the **Python3(ipykernel)** on upper-right corner
3. Pick **Python(.venv)Reason1** in *Start python Kernel* section, then click Select button. (If you don't see the option, try restaring the notebook.)
4. The upper-right kernel button should be updated to *Python(.venv)Reason1*

### Download Checkpoints from Huggingface
---
You should see the following message when all files have been downloaded successfully.
```bash
'/home/ubuntu/nvidia/Cosmos-Reason1-7B'
```

In [None]:
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="nvidia/Cosmos-Reason1-7B",
    local_dir="nvidia/Cosmos-Reason1-7B",
    revision="8fe96c1fa10db9e666b6fa6a87fea57dd9635649"
)

### Use Case #1: Transformers
---
This Python script performs multimodal inference using the Cosmos-Reason1-7B model. It processes a combination of a text prompt, image files, and video files, then generates and prints a textual response based on the visual content.

In [None]:
import qwen_vl_utils
import transformers
from rich import print

# --- Parameters to configure ---
prompt = "Please describe the video."
images = []  # TODO: Optional, replace with your actual image path(s)
videos = ["sample.mp4"]  # TODO: Optional, replace with your video path(s) if any

# You can leave these as default or change them
model_name = "nvidia/Cosmos-Reason1-7B"
system_prompt = "You are a helpful assistant. Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>."
fps = 1  # Downsample video frame rate
max_pixels = 81920  # Downsample media max pixels
# --- End of configuration ---

# 1. Process the media and text inputs
user_content = []
for image in images or []:
    user_content.append(
        {"type": "image", "image": image, "max_pixels": max_pixels}
    )
for video in videos or []:
    user_content.append(
        {"type": "video", "video": video, "fps": fps, "max_pixels": max_pixels}
    )
user_content.append({"type": "text", "text": prompt})

# 2. Format the messages for the model
messages = []
if system_prompt:
    messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": user_content})
print("Messages:", messages)

# 3. Load the model and processor
model = transformers.Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
processor: transformers.Qwen2_5_VLProcessor = (
    transformers.AutoProcessor.from_pretrained(model_name, use_fast=True)
)

# 4. Set up generation configuration
generation_config = transformers.GenerationConfig(
    do_sample=True,
    max_new_tokens=4096,
    repetition_penalty=1.05,
    temperature=0.6,
    top_p=0.95,
)

# 5. Process the messages into tensors
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = qwen_vl_utils.process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# 6. Run inference
generated_ids = model.generate(**inputs, generation_config=generation_config)
generated_ids_trimmed = [
    out_ids[len(in_ids) :]
    for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)

# 7. Print the final output
print(f"\n\n{output_text[0]}")

### Use Case #2: vLLM
---
This script uses the VLLM library for high-performance multimodal inference with the Cosmos-Reason1-7B model. It processes a video file (sample.mp4) along with a text question, and then generates and prints a textual answer based on the video's content.

It takes a few seconds to start printing outputs. You should see "<\/answer>" at the end of the log.

NOTE: If you encounter *RuntimeError: CUDA error*, try upgrading GPU driver to 570.172.08 (CUDA = 12.8).

In [None]:
from rich import print
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

# --- Parameters to configure ---
VIDEO_PATH = "sample.mp4"  # TODO: Optional, replace with your video path
MODEL_PATH = "nvidia/Cosmos-Reason1-7B"
# --- End of configuration ---


llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 10, "video": 10},
)

sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.05,
    max_tokens=4096,
)

video_messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant. Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>.",
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": ("Is it safe to turn right?")},
            {
                "type": "video",
                "video": VIDEO_PATH,
                "fps": 4,
            },
        ],
    },
]

# Here we use video messages as a demonstration
messages = video_messages

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(
    messages, return_video_kwargs=True
)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
    # FPS will be returned in video_kwargs
    "mm_processor_kwargs": video_kwargs,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)