# Lecture 12: Visual and Large Multimodal Models

This notebook focuses on how to utilize Visual or Large Multimodal Models (LMM) for some inference tasks and model fine-tuning.

We primarily use three models: 

1. Stable-diffusion (Separate Images, ComfyUI-tutorial.md and sd-WebUI-tutorial.md )
    A latent text-to-image diffusion model.(https://github.com/CompVis/stable-diffusion)
    2.1 ComfyUI (for Inference)
    2.2 WebUI (for Training)

2. Segment anything model (SAM 2):  
    SAM2 is a foundation model towards solving promptable visual segmentation in images and videos. 
    (https://huggingface.co/facebook/sam2-hiera-large)
    
3. Qwen2-VL:  
    Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud. It can answer questions about graphs and diagrams, answer UI-based questions, and do fine-grained localization on screen images.
    The model is available on HuggingFace: (https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) 



Note: The weights of these several models in this notebook are sourced from HuggingFace and have already been downloaded to **`/ssdshare/share/lab5`**, you can use them directly. 
You can also pre-download the model weights and store them locally on your own. 

## Part 1 Image segmentation with SAM-2

### 1. Preparing the input/output/configuration directoreis

In [1]:
!mkdir -p sam2
!ln -s /ssdshare/share/lab5/sam2/checkpoints/ sam2/checkpoints

import os
data_path = "/ssdshare/share/lab5/sam2data"
image_path = os.path.join(data_path, "images")
video_path = os.path.join(data_path, "videos")
sam2_path = "sam2"

### 2. Image Segmentation with SAM-2
SAM-2 is a foundation model towards solving promptable visual segmentation in images and videos. In this task, we will use SAM-2 to segment objects in an image and a video.

In [2]:
# configure the GPU (or CPU), here we use the GPU version

import os
import torch


if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
print(f"using device: {device}")

if device.type == "cuda":
    # use bfloat16 for the entire notebook
    torch.autocast("cuda", dtype=torch.bfloat16).__enter__()
    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
    if torch.cuda.get_device_properties(0).major >= 8:
        torch.backends.cuda.matmul.allow_tf32 = True
        torch.backends.cudnn.allow_tf32 = True



In [3]:
# import SAM and other image processing dependencies
from sam2.build_sam import build_sam2, build_sam2_video_predictor
from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator, SAM2ImagePredictor
import matplotlib.pyplot as plt



In [4]:
import numpy as np
np.random.seed(3)  # optional, for reproducibility

In [5]:
# helper functions to display the generated masks

def show_mask(mask, ax, random_color=False, borders=True, obj_id=None):
    # Display one of the generated masks
    if random_color:
        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    elif obj_id is None:
        color = np.array([30/255, 144/255, 255/255, 0.6])
    else:
        cmap = plt.get_cmap("tab10")
        cmap_idx = 0 if obj_id is None else obj_id
        color = np.array([*cmap(cmap_idx)[:3], 0.6])
    h, w = mask.shape[-2:]
    mask = mask.astype(np.uint8)
    mask_image =  mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    if borders:
        import cv2
        contours, _ = cv2.findContours(mask,cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE) 
        # Try to smooth contours
        contours = [cv2.approxPolyDP(contour, epsilon=0.01, closed=True) for contour in contours]
        mask_image = cv2.drawContours(mask_image, contours, -1, (1, 1, 1, 0.5), thickness=2) 
    ax.imshow(mask_image)

def show_points(coords, labels, ax, marker_size=375):
    # condition points
    pos_points = coords[labels==1]
    neg_points = coords[labels==0]
    ax.scatter(pos_points[:, 0], pos_points[:, 1], color='green', marker='*', s=marker_size, edgecolor='white', linewidth=1.25)
    ax.scatter(neg_points[:, 0], neg_points[:, 1], color='red', marker='*', s=marker_size, edgecolor='white', linewidth=1.25)   

def show_box(box, ax):
    # condition box
    x0, y0 = box[0], box[1]
    w, h = box[2] - box[0], box[3] - box[1]
    ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0, 0, 0, 0), lw=2))    

def show_masks(image, masks, scores, point_coords=None, box_coords=None, input_labels=None, borders=True):
    # sam2 generate several possible masks
    # iterate over masks and display them
    for i, (mask, score) in enumerate(zip(masks, scores)):
        plt.figure(figsize=(10, 10))
        plt.imshow(image)
        show_mask(mask, plt.gca(), borders=borders)
        if point_coords is not None:
            assert input_labels is not None
            show_points(point_coords, input_labels, plt.gca())
        if box_coords is not None:
            # boxes
            show_box(box_coords, plt.gca())
        if len(scores) > 1:
            plt.title(f"Mask {i+1}, Score: {score:.3f}", fontsize=18)
        plt.axis('off')
        plt.show()

In [6]:
# here is how to read an input image
from PIL import Image
truck_image = Image.open(os.path.join(image_path, "truck.jpg"))
truck_image

In [7]:
# load configurations of SAM-2

import hydra  # hydra is a configuration management library
hydra.core.global_hydra.GlobalHydra.instance().clear()

hydra.initialize(config_path=sam2_path)

# build sam2 model
sam2_checkpoint = "sam2/checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"

sam2 = build_sam2(model_cfg,
                    sam2_checkpoint, 
                    device=device, 
                    apply_postprocessing=False # When set to True, applies post-processing steps to refine masks:
                                              # - Removes small disconnected regions
                                              # - Fills small holes
                                              # - Applies boundary smoothing
                                              # These steps improve mask quality but add computational overhead
                    )
# create automatic mask generator and image predictor
mask_generator = SAM2AutomaticMaskGenerator(
    model=sam2, # the model built above
    points_per_side=64, # Grid size for sampling points
    points_per_batch=128, # Number of points to process at once
    pred_iou_thresh=0.7, # IoU threshold for prediction confidence
    stability_score_thresh=0.92, # Minimum stability score for a mask to be considered
    stability_score_offset=0.7, # Offset for stability score threshold
    crop_n_layers=1, # Number of layers to crop the image
    box_nms_thresh=0.7, # IoU threshold for non-maximum suppression
    crop_n_points_downscale_factor=2, # Downscale factor for cropping points
    min_mask_region_area=25.0,  # Minimum area for a mask to be considered
    use_m2m=True, # Use M2M (Mask-to-Mask) for mask refinement - a technique that refines initial mask predictions by using them as input for subsequent iterations, improving segmentation quality
)


In [8]:
# generate masks
# the first run loads the model weights, which takes a while
image = np.array(truck_image.convert("RGB"))
masks = mask_generator.generate(image)
plt.figure(figsize=(20, 20))
plt.imshow(image)
for mask in masks:
    show_mask(mask['segmentation'], plt.gca(), random_color=True, borders=False)
plt.axis('off')
plt.show() 

In [9]:
# check condition point
# 1 in the label means that the point is inside the object
input_point = np.array([[500, 375]]) # locate the point on the object
input_label = np.array([1]) # the point is inside the object
plt.figure(figsize=(10, 10))
plt.imshow(image)
show_points(input_point, input_label, plt.gca())
plt.axis('on')
plt.show()  

In [10]:
# step 1: predict masks
predictor = SAM2ImagePredictor(sam2)
predictor.set_image(image)
masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True,  # output all the masks
)
# step 2: sort by probability score
sorted_ind = np.argsort(scores)[::-1] # [::-1]reverses the order
masks = masks[sorted_ind]
scores = scores[sorted_ind]
logits = logits[sorted_ind]

# show all the masks
show_masks(image, masks, scores, point_coords=input_point, input_labels=input_label, borders=True)

In [11]:
# we can use more condition points to specify a more concrete area
input_point = np.array([[500, 375], [1125, 625]])
input_label = np.array([1, 1])

# 
mask_input = logits[np.argmax(scores), :, :]  # Choose the model's best mask
                                              # :, : means to keep all dimensions in the height and width of the mask
masks, scores, _ = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    mask_input=mask_input[None, :, :],  # first dimension is the batch size, here we only have one
    multimask_output=False,
)
show_masks(image, masks, scores, point_coords=input_point, input_labels=input_label)

In [12]:
# red star is negative point, which means the area should not be included in the mask
input_point = np.array([[500, 375], [1125, 625]])
input_label = np.array([1, 0])   ## 0 is a negative point

mask_input = logits[np.argmax(scores), :, :]  # Choose the model's best mask
masks, scores, _ = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    mask_input=mask_input[None, :, :],
    multimask_output=False,
)
show_masks(image, masks, scores, point_coords=input_point, input_labels=input_label)

In [13]:
# we can use a box to specify an area
input_box = np.array([425, 600, 700, 875]) # a box is a tuple of (x1, y1, x2, y2)
masks, scores, _ = predictor.predict(
    point_coords=None,
    point_labels=None,
    box=input_box[None, :],  # the box
    multimask_output=False,
)
show_masks(image, masks, scores, box_coords=input_box)

In [14]:
# we can combine box and point to specify an area
input_box = np.array([425, 600, 700, 875])
input_point = np.array([[575, 750]])
input_label = np.array([0])
masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    box=input_box,
    multimask_output=False,
)
show_masks(image, masks, scores, box_coords=input_box, point_coords=input_point, input_labels=input_label)

Choose from *one of* the following two tasks.  

In [None]:
#### your task ####
# segment all the groceries in the image '/ssdshare/share/lab5/sam2data/images/groceries.jpg'
# the mask should contain all the groceries (in one mask), including the bags and the visible items, but nothing else
# display the segmentation results as musks
# you can either use points or a box to help the segmentation


In [None]:
#### your task ####
# segment out the license plate in the image '/ssdshare/share/lab5/sam2data/images/cars.jpg'


In [15]:
# clean up

del sam2
del predictor
del mask_generator
del logits
del mask_input
for i in range(3):
    # clear cuda memory
    torch.cuda.empty_cache()
    # rubish collection
    import gc
    gc.collect()


### 3. Using SAM2 for Video Segmentation

In [16]:
# create sam2 video predictor
predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint, device=device)

#### Step 0: Extracting all frames of video into images (Already done for you)

A video is a sequence of images. 

First, we will extract all frames from the video.

For example, in ./videos folder, we have a video named bedroom.mp4. We can extract the frames from the video using the following command:
```
mkdir bedroom
ffmpeg -i ./videos/bedroom.mp4 -q:v 2 -start_number 0 ./bedroom/'%05d.jpg'
```
where `-q:v` generates high-quality JPEG frames and `-start_number 0` asks ffmpeg to start the JPEG file from `00000.jpg`.

If you want to try this task with your own video, you can upload it to the ./videos folder and extract the frames using the above command.

!!!!Before running the above command, make sure you have the ffmpeg package installed. If not, you can install it using the following command:
```
apt install ffmpeg
```

Then we can examine the extracted images.

In [17]:
# `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg`
video_dir = os.path.join(video_path, "bedroom")

# scan all the JPEG frame names in this directory
frame_names = [
    p for p in os.listdir(video_dir)
    if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
]
frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))

# take a look the first video frame
frame_idx = 0
plt.figure(figsize=(9, 6))
plt.title(f"frame {frame_idx}")
plt.imshow(Image.open(os.path.join(video_dir, frame_names[frame_idx])))

#### Step 1: Segment out of a single frame

We show an example of segmenting the boy out of video.  First, let's choose frame 0 and add a first click on it. 

In [18]:
# SAM 2 requires stateful inference for interactive video segmentation, so we need to initialize an **inference state** on this video.
# During initialization, it loads all the JPEG frames in `video_path` and stores their pixels in `inference_state` (as shown in the progress bar below).
inference_state = predictor.init_state(video_path=video_dir)

# Chose a frame index to interact with
ann_frame_idx = 0  # frame index
ann_obj_id = 1  # give a unique id to each object we interact with (it can be any integers)

# Let's add a positive click at (x, y) = (210, 350) to get started
points = np.array([[210, 350]], dtype=np.float32)  # the boy
labels = np.array([1], np.int32)  # positive click
_, out_obj_ids, out_mask_logits = predictor.add_new_points_or_box(
    inference_state=inference_state,
    frame_idx=ann_frame_idx,
    obj_id=ann_obj_id,
    points=points,
    labels=labels,
)

# show the results on the current (interacted) frame
plt.figure(figsize=(9, 6))
plt.title(f"frame {ann_frame_idx}")
plt.imshow(Image.open(os.path.join(video_dir, frame_names[ann_frame_idx])))
show_points(points, labels, plt.gca())
show_mask((out_mask_logits[0] > 0.0).cpu().numpy(), plt.gca(), obj_id=out_obj_ids[0], borders=False)

This not what we want, let's try to choose the entire boy.  We can add a second click to refine the prediction.

In [19]:
ann_frame_idx = 0  # the frame index we interact with
ann_obj_id = 1  # give a unique id to each object we interact with (it can be any integers)

# Let's add a 2nd positive click at (x, y) = (250, 220) to refine the mask
# sending all clicks (and their labels) to `add_new_points_or_box`
points = np.array([[210, 350], [250, 220]], dtype=np.float32)  # a second click on the boy
labels = np.array([1, 1], np.int32)  # both are positive clicks
_, out_obj_ids, out_mask_logits = predictor.add_new_points_or_box(
    inference_state=inference_state,
    frame_idx=ann_frame_idx,
    obj_id=ann_obj_id,
    points=points,
    labels=labels,
)

# show the results on the current (interacted) frame
plt.figure(figsize=(9, 6))
plt.title(f"frame {ann_frame_idx}")
plt.imshow(Image.open(os.path.join(video_dir, frame_names[ann_frame_idx])))
show_points(points, labels, plt.gca())
show_mask((out_mask_logits[0] > 0.0).cpu().numpy(), plt.gca(), obj_id=out_obj_ids[0], borders=False)

With this 2nd refinement click, now we get a segmentation mask of the entire child on frame 0.  Now let's expand the result into the entire video.

#### Step 2: Propagate the prompts to get the masklet across the video

To get the masklet throughout the entire video, we propagate the prompts using the `propagate_in_video` API.

In [20]:
# run propagation throughout the video and collect the results in a dict
video_segments = {}  # video_segments contains the per-frame segmentation results
# iterate over the frames of the video
for out_frame_idx, out_obj_ids, out_mask_logits in predictor.propagate_in_video(inference_state):
    video_segments[out_frame_idx] = {
        out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
        for i, out_obj_id in enumerate(out_obj_ids)
    }

# render the segmentation results every few frames
vis_frame_stride = 30
plt.close("all")
for out_frame_idx in range(0, len(frame_names), vis_frame_stride):
    plt.figure(figsize=(6, 4))
    plt.title(f"frame {out_frame_idx}")
    plt.imshow(Image.open(os.path.join(video_dir, frame_names[out_frame_idx])))
    for out_obj_id, out_mask in video_segments[out_frame_idx].items():
        show_mask(out_mask, plt.gca(), obj_id=out_obj_id, borders=False)

#### Step 4: Add new prompts to further refine the masklet

It appears that in the output masklet above, there are some small imperfections in boundary details on frame 150.

We can tell the model not to include the imperfections by adding a new prompt on frame 150.

In [21]:
ann_frame_idx = 150  # further refine some details on this frame
ann_obj_id = 1  # give a unique id to the object we interact with (it can be any integers)

# show the segment before further refinement
plt.figure(figsize=(9, 6))
plt.title(f"frame {ann_frame_idx} -- before refinement")
plt.imshow(Image.open(os.path.join(video_dir, frame_names[ann_frame_idx])))
show_mask(video_segments[ann_frame_idx][ann_obj_id], plt.gca(), obj_id=ann_obj_id, borders=False)

# Let's add a negative click on this frame at (x, y) = (82, 415) to refine the segment
points = np.array([[82, 410]], dtype=np.float32)
# for labels, `1` means positive click and `0` means negative click
labels = np.array([0], np.int32)
_, _, out_mask_logits = predictor.add_new_points_or_box(
    inference_state=inference_state,
    frame_idx=ann_frame_idx,
    obj_id=ann_obj_id,
    points=points,
    labels=labels,
)

# show the segment after the further refinement
plt.figure(figsize=(9, 6))
plt.title(f"frame {ann_frame_idx} -- after refinement")
plt.imshow(Image.open(os.path.join(video_dir, frame_names[ann_frame_idx])))
show_points(points, labels, plt.gca())
show_mask((out_mask_logits > 0.0).cpu().numpy(), plt.gca(), obj_id=ann_obj_id, borders=False)

#### Step 5: Propagate the prompts (again) to get the masklet across the video

Let's get an updated masklet for the entire video. Here we call `propagate_in_video` again to propagate all the prompts after adding the new refinement click above.

In [22]:
# run propagation throughout the video and collect the results in a dict
video_segments = {}  # video_segments contains the per-frame segmentation results
for out_frame_idx, out_obj_ids, out_mask_logits in predictor.propagate_in_video(inference_state):
    video_segments[out_frame_idx] = {
        out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
        for i, out_obj_id in enumerate(out_obj_ids)
    }

# render the segmentation results every few frames
vis_frame_stride = 30
plt.close("all")
for out_frame_idx in range(0, len(frame_names), vis_frame_stride):
    plt.figure(figsize=(6, 4))
    plt.title(f"frame {out_frame_idx}")
    plt.imshow(Image.open(os.path.join(video_dir, frame_names[out_frame_idx])))
    for out_obj_id, out_mask in video_segments[out_frame_idx].items():
        show_mask(out_mask, plt.gca(), obj_id=out_obj_id, borders=False)

In [None]:
#### your task ####
# segment the `dog` in video file(already processed into jpgs) '/ssdshare/share/lab5/sam2data/videos/outdoor/'
# and show the segmentation results (frame 1, 20, 50, 80, 100, 140)

In [23]:
# clean up

del predictor
del out_mask_logits
del video_segments
del inference_state
del masks
del mask
for i in range(3):
    torch.cuda.empty_cache()
    import gc
    gc.collect()

## Part 3. ImageQA with Qwen2-VL
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud. It can answer questions about graphs and diagrams, answer UI-based questions, and do fine-grained localization on screen images.

Use the pre-downloaded model weights: /ssdshare/share/lab5/Qwen2-VL-7B-Instruct, or make sure you have downloaded the weights of the model from HuggingFace.

In [58]:
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

Load the model in half-precision (bf16) on the available device(s) 


In [63]:
#model = Qwen2VLForConditionalGeneration.from_pretrained(
#    "/ssdshare/share/lab5/Qwen2-VL-7B-Instruct",  # the model weights
#    torch_dtype=torch.bfloat16,  # more about these settings later in this course
#    device_map="cuda:0", # use GPU 0
#    attn_implementation="flash_attention_2" # more about these settings later in this course
#)
processor = AutoProcessor.from_pretrained("/ssdshare/share/lab5/Qwen2-VL-7B-Instruct")

In [64]:
# Check the special tokens
# Prepare for activating the model's ability to localize objects in the image. 
#  This is useful to create the prompts in the later steps. 
#  (i.e., so you know where to put the tokens representing the image.)

processor.tokenizer.special_tokens_map

In [74]:
# Clean up
del processor
for i in range(3):
    torch.cuda.empty_cache()
    import gc
    gc.collect()


Wrap the model and processor in a class, so that we can use it more easily


In [78]:
class Qwen2VL:
    def __init__(self, model_name: str = "/ssdshare/share/lab5/Qwen2-VL-7B-Instruct"):
        if torch.cuda.is_available():
            print("You are running the model on GPU.")
            self.device = torch.device("cuda:0")
        else:
            print("You are running the model on CPU.")
            self.device = torch.device("cpu")
        self.dtype = torch.bfloat16
        print("Loading model and processor...")
        self.model = Qwen2VLForConditionalGeneration.from_pretrained(
            model_name, torch_dtype=self.dtype, device_map=self.device, attn_implementation="flash_attention_2"
        )
        self.processor = AutoProcessor.from_pretrained(model_name)
        print("Model and processor loaded.")

    def generate(self, text: str, image: Image, max_new_tokens: int = 128):
        # Same as the previous code
        conversation = [
            {
                "role": "user",
                "content": [
                    {"type": "image",},
                    {"type": "text", "text": text},
                ],
            }
        ]
        text_prompt = self.processor.apply_chat_template(conversation, add_generation_prompt=True)
        inputs = self.processor(
            text=[text_prompt], images=[image], padding=True, return_tensors="pt"
        )
        inputs = inputs.to(self.device)
        output_ids = self.model.generate(**inputs, max_new_tokens=max_new_tokens)
        generated_ids = [
            output_ids[len(input_ids) :]
            for input_ids, output_ids in zip(inputs.input_ids, output_ids)
        ]
        output_text = self.processor.batch_decode(
            generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
        )
        return output_text[0]
    
    def grounding(self, text: str, image: Image):
        # text is your description of the object
        # We can't easily ask the model to localize objects through natural language (You can try it by yourself, use the generate method)
        # According to https://arxiv.org/pdf/2409.12191 and the special tokens checked above, we can use the following template
        # <|vision_start|>Picture1.jpg<|vision_end|>
        # <|object_ref_start|>the eyes on a giraffe<|object_ref_end|><|box_start|>(176,106),(232,160)<|box_end|>
        def get_pos_1000(text):
            import re
            # left, top, right, bottom
            return list(map(int, re.findall(r"\d+", text)))
        
        def drawbox_on_image(image_, pos):
            # use PIL to draw box on image
            from PIL import ImageDraw
            image = image_.copy()
            draw = ImageDraw.Draw(image)
            # map pos from 0 ~ 1000 to image size
            pos[0] = pos[0] * image.width // 1000
            pos[1] = pos[1] * image.height // 1000
            pos[2] = pos[2] * image.width // 1000
            pos[3] = pos[3] * image.height // 1000
            draw.rectangle(pos, outline="red", width=3)
            return image
        
        ## the prompt should match the special tokens above
        text_prompt = f"""<|vision_start|><|image_pad|><|vision_end|>\n<|object_ref_start|>{text}<|object_ref_end|><|box_start|>"""
        inputs = self.processor(
            text=[text_prompt], images=[image], padding=True, return_tensors="pt"
        )
        inputs = inputs.to(self.device)
        output_ids = self.model.generate(**inputs, max_new_tokens=64, eos_token_id=self.processor.tokenizer.encode("<|box_end|>"))
        generated_ids = [
            output_ids[len(input_ids) :]
            for input_ids, output_ids in zip(inputs.input_ids, output_ids)
        ]
        output_text = self.processor.batch_decode(
            generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
        )
        # return output_text[0]
        pos = get_pos_1000(output_text[0])
        return drawbox_on_image(image, pos)
    
    def design(self, text: str, image: Image):
        output_text = self.generate(text, image, max_new_tokens=768)
        def get_html(text):
            import re
            return re.findall(r"```html(.*?)```", text, re.DOTALL)
        html = get_html(output_text)[0]
        if len(html) == 0:
            print("Invalid output text.")
            print(output_text)
            return None
        return html
        

In [79]:
llm = Qwen2VL()


In [81]:
import os
imgpath = "/ssdshare/share/lab5/vqadata"

In [82]:
prompt = 'How many bottles of [Magna] beer are there? Please note that several types of beer might be on the table.'
img = Image.open(os.path.join(imgpath, "test0.jpg"))
img

In [83]:
llm.generate(prompt, img)

In [84]:
prompt =' Describe what is Object 1 and object 2. Tell me what is in the circled glass.'
img = Image.open(os.path.join(imgpath, "test1.jpg"))
img

In [85]:
llm.generate(prompt, img)

In [86]:
prompt= ' Please read the text in this image and return the information in the following JSON format (note xxx is placeholder, if the information is not available in the image, put "N/A" instead). {"class": xxx, "DLN": xxx, "DOB": xxx, "Name": xxx, "Address": xxx, "EXP": xxx, "ISS": xxx, "SEX": xxx, "HGT": xxx, "WGT": xxx, "EYES": xxx, "HAIR": xxx, "DONOR": xxx}'
img = Image.open(os.path.join(imgpath, "test2.jpg"))
img

In [87]:
llm.generate(prompt, img)

In [93]:
prompt ='Count the number of apples in the image.'
img = Image.open(os.path.join(imgpath, "test3.jpg"))
img

In [94]:
llm.generate(prompt, img)

In [95]:
prompt = 'Describe the landmark in the image.'
img = Image.open(os.path.join(imgpath, "test6.jpg"))
img

In [96]:
llm.generate(prompt, img)

In [98]:
prompt = 'Describe the name of the dish.'
img = Image.open(os.path.join(imgpath, "test7.jpeg"))
img

In [99]:
llm.generate(prompt, img)

In [100]:
prompt = 'What is wrong with the foot in this figure??'
img = Image.open(os.path.join(imgpath, "test8.jpg"))
img

In [101]:
llm.generate(prompt, img)

In [102]:
prompt ='What is the spatial relation between the frisbee and the man?'
img = Image.open(os.path.join(imgpath, "test9.jpg"))
img

In [103]:
llm.generate(prompt, img)

In [104]:
prompt = 'Which oceans surround Africa?  both to the east and to the west.'
img = Image.open(os.path.join(imgpath, "test13.jpg"))
img

In [105]:
llm.generate(prompt, img)

In [106]:
grounding_img = Image.open(os.path.join(imgpath, "grounding.jpg"))
grounding_img = grounding_img.resize((grounding_img.width // 2, grounding_img.height // 2))
llm.grounding("A squirrel", grounding_img)

In [50]:
llm.grounding("Tail of the squirrel", grounding_img)

In [77]:
# clean up

del llm

for i in range(3):
    torch.cuda.empty_cache()
    import gc
    gc.collect()

In [52]:
### Your Task
# 1. run one different vlm, and ask two questions.
# 2. refer to lab5/3-qwen2vl_api.ipynb, use different vlms to generate text.


## 4. Text-to-Image with Stable Diffusion

### 4.1 Comfy-UI

ComfyUI is a powerful modular diffusion model GUI that allows users to easily create and run diffusion models.


### Step 1. Running the ComfyUI on the cluster
We have built a Docker image that contains all the necessary dependencies to run ComfyUI. To run ComfyUI, follow the steps below:

1. Copy and create a new configuation file based on https://github.com/iiisthu/ailab/blob/main/user/comfyui-template.yaml .  You should set your namespace in the configure file, and use harbor-local.ai.iiis.co/llm-course/comfyui:v1 as the Docker image (already set in the template).

2. Helm install, just as other labs.

3. In order to access the website your started in the cluster, you should run the following command to forward the port to your local computer.  (you can also do it with the Kubernetes plugin in VSCode).  
```bash
# on your local PC
kubectl port-forward pod/<pod_name> 8188:8188
```

4. Opne a brower and visit https://127.0.0.1:8188 , you should see your UI.  


### Step 2. Playing with the comfy UI.
![ComfyUI Interface](assets/preview.png)
1. Import workflow configuration file. Click the `Workflow` button and select the workflow configuration file stored in **your PC**. We provide some example configs in `LLM-applications-course/lec5/comfy_example_workflows`. You can download to your PC and import them.
![Example Workflow Configuration File](assets/select.png)

2. Select models. You can enlarge the page to see the workflow more clearly. Here we can see in this workflow, the `Load Checkpoint` module requires the `ckpt_name` should be one of sdv3/2b_1024/sd3_medium.safetensors. For example, we select sd3_medium.safetensors.

3. Write your text prompt in `CLIP Text Encode` module. `Prompt` is what we want, `Negative Prompt` is what we don't want. 

4. (Optional) Change the `Seed` to `randomize` generate different results.

5. (Optional) Customize your workflow, e.g., add a lora module.

6. Click `Queue` to run the workflow.


In [None]:
#### your task ####
# write down your prompts here
# and copy the generated image below
# if you have a different workflow from the above (either you write your own, or you find on the Internet)
# please submit it together with this notebook

### 4.2 Training Stable Diffusion with WebUI

stable-diffusion-webui is a web interface for Stable Diffusion. Here we provide a tutorial on how to train Stable Diffusion using the web interface.

#### Step 1. Running the Stable Diffusion WebUI on the cluster
We have built a Docker image that contains all the necessary dependencies to run sd-WebUI. To run sd-WebUI, follow the steps below:

Follow a similar procedure to start the webUI, using webui-template.yaml. 
This time, please forward port 7860 instead. 

You should be able to access the GUI at https://127.0.0.1:7860  (from your pc)

#### Step 2. Playing with SD-WebUI
Pay attention to all paths in pictures below. If a path starts with "/share", you should replace it with "/ssdshare/share", e.g., "/share/lab5/clip-vit-l-14" -> "/ssdshare/share/lab5/clip-vit-l-14".

We want to teach the model a `concept` of `headless_statue`.
![Headless Statue](assets/sd-hdst.jpeg)

1. Select a sd model from the dropdown list.
![Select Model](assets/sd-ckpt.png)

2. Try to generate images using prompt: "An oil painting of headless_statue". We are not satisfied with the generated images.
![Generate Images](assets/sd-base.png)

3. Preprocess our `headless_statue` images. Set parameters following images below, and click `Generate` button.
![Preprocess Images](assets/sd-preprocess.png)

4. Create embedding for our concept `headless_statue`. 

- Name: filename for the created embedding. You will also use this text in prompts when referring to the embedding.

- Initialization text: the embedding you create will initially be filled with vectors of this text. If you create a one vector embedding named "zzzz1234" with "tree" as initialization text, and use it in prompt without training, then prompt "a zzzz1234 by monet" will produce same pictures as "a tree by monet".

- Number of vectors per token: the size of embedding. The larger this value, the more information about subject you can fit into the embedding, but also the more words it will take away from your prompt allowance. With stable diffusion, you have a limit of 75 tokens in the prompt. If you use an embedding with 16 vectors in a prompt, that will leave you with space for 75 - 16 = 59. Also from my experience, the larger the number of vectors, the more pictures you need to obtain good results.

![Create Embedding](assets/sd-create-emb.png)

5. Train the embedding. Set parameters following images below, and click `Train Embedding` button.
![Train Embedding](assets/sd-train.png)
![Sample](assets/sd-sample.png)

6. Generate images using prompt: "An oil painting of `Name`". We can see the generated images are more related to our concept `headless_statue`.
![Result](assets/sd-result.png)


### Textual Inversion
Textual Inversion is a parameter efficient method to train Stable Diffusion. This method can be used to represent a wide array of concepts. Trained on this method, Stable Diffusion can learn a pseudo-word that represents a specific artist or a new concept.
![Textual Inversion](assets/teaser.JPG)

#### Why we need Textual Inversion Algorithm?
Suppose we have a sd model (which is trained on a specific dataset without the image of `The Thinker`). When we want to use it to generate `A cat in the pose of The Thinker`, we need rewrite our prompt to `A cat with its hand on its chin, sitting on a rock, its eyes looking down thoughtfully`. This is because the model doesn't know what the exact pose of `The Thinker` is. Textual Inversion can help us to find the pseudo-word that represents the concept of `The Thinker` with 3-5 `The Thinker` images.

#### How does it work?
The essence of Textual Inversion is to map the object in the image to a pseudo-word(A high dimension vector actually. Not necessarily a natural language word, we barely use natural language word to tag it)
![Principal](assets/training.JPG)

#### Why not train embedding directly?
The scale of the embedding is $\frac{vocab\_ size}{token\_ vectors\_ num}$ times of the pseudo-word embedding in Textual Inversion, which requires far more data to train (Where vocab_size is the size of the vocabulary used in stable diffusion, and token_vectors_num is the number of token vectors we have to train in textual inversion).