# 一、Kimi-VL-A3B-Instruct推理示例

## （一）导入必要的库

In [4]:
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

## （二）加载模型

In [8]:
# model_path = "moonshotai/Kimi-VL-A3B-Instruct"
model_path = "/root/autodl-tmp/moonshotai/Kimi-VL-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,  # 需要启用这个参数
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

## （三）准备输入数据

In [9]:
image_path = "/root/Kimi-VL/figures/demo.png"
image = Image.open(image_path)
messages = [
    {"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": "What is the dome building in the picture? Think step by step."}]}
]

## (四）生成和解码响应

In [10]:
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)

To identify the dome building in the picture, we can follow these steps:

1. **Observation of the Building**: The dome building is centrally located in the image and is characterized by its large, white, dome-shaped roof. This distinctive architectural feature makes it stand out among the surrounding structures.

2. **Contextual Clues**: The building is situated in an urban environment with high-rise buildings and a busy highway, indicating it is likely a significant landmark within a city.

3. **Identification of the Structure**: The dome building is identified as the Rogers Centre, a well-known multi-purpose stadium in Toronto, Canada. The Rogers Centre is famous for its retractable roof and is a prominent feature of the city's skyline.

4. **Verification with Known Information**: Cross-referencing with known images and information about the Rogers Centre confirms that the building in the picture matches the description and appearance of the Rogers Centre.

Therefore, the dome buildi

## (五）中文提问体验

In [11]:
image_path = "/root/Kimi-VL/figures/demo.png"
image = Image.open(image_path)
messages = [
    {"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": "图片中有什么内容？"}]}
]

In [12]:
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)

这张图片展示了一个城市的天际线，背景是黄昏时分的天空。图片中可以看到：

1. **高楼大厦**：左侧和右侧都有高层建筑，其中左侧的建筑有玻璃幕墙，右侧的建筑有独特的圆顶结构。
2. **交通**：前景中有一条繁忙的公路，车辆的车灯在道路上形成了一条光带，显示出交通的繁忙。
3. **地标建筑**：右侧有一个高耸的塔楼，可能是城市的标志性建筑之一。
4. **天空**：天空中有一些云彩，呈现出黄昏时分的柔和色彩。

整体画面展示了城市的繁华和现代感。
