## 创建一个llava模型

1. 需要设置`<image>`这个token id，从之前的多个，设置成一个
2. 需要设置pad_token_id
3. 将clip模型的vision_model模块进行提取
4. 将language_model模块进行提取
5. 相关文件复制

### 下载好模型

```bash

pip install -U huggingface_hub


export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download --resume-download openai/clip-vit-large-patch14-336 --local-dir openai/clip-vit-large-patch14-336 --local-dir-use-symlinks False


huggingface-cli download --resume-download Qwen/Qwen1.5-4B-Chat --local-dir Qwen1.5-4B-Chat --local-dir-use-symlinks False
```



### 修改qwen的tokenizer的相关文件：设置`<image>`这个token id

1. 在`tokenizer_config.json`文件里面的`added_tokens_decoder`里面，加上这个东西：

```json
"151646": {
      "content": "<image>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
```

2. 在`tokenizer_config.json`文件里面的`additional_special_tokens` 里面加上 `"<image>"`



### 验证一下


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
print(torch.__version__)
modify_qwen_tokenizer_dir = "./Qwen/Qwen2.5-3B-Instruct"
modify_llama_tokenizer_dir = "./llava-1.5-7b-hf"

modify_qwen_tokenizer = AutoTokenizer.from_pretrained(modify_qwen_tokenizer_dir)
modify_llama_tokenizer = AutoTokenizer.from_pretrained(modify_llama_tokenizer_dir)

modify_qwen_tokenizer.encode("<image>")
# modify_llama_tokenizer.encode("<image>")

2.1.2
151645


In [None]:
print(len(modify_qwen_tokenizer))

Q: 加了这个新token，需要修改模型的embedding模块么？
A：不需要，qwen_model.model.embed_tokens留了足够的空间

In [None]:
qwen_model = AutoModelForCausalLM.from_pretrained(modify_qwen_tokenizer_dir, device_map='cuda:0', 
                                                  torch_dtype=torch.bfloat16)

In [None]:
# qwen_model的二维embed_tokens矩阵的行数为151936，为我们已经预留好了 special_token 的数量了
qwen_model.model.embed_tokens

In [None]:
qwen_model.lm_head
qwen_model

## 重启，开始进行初始化
## llava模型初始化

In [2]:
clip_model_name_or_path = (
    "./openai/clip-vit-large-patch14-336"
)
qwen_model_name_or_path = "./Qwen/Qwen2.5-3B-Instruct"

In [3]:
from transformers import (  AutoModel, 
                            AutoModelForCausalLM, 
                            AutoTokenizer, 
                            AutoProcessor,
                        )

clip_model = AutoModel.from_pretrained(clip_model_name_or_path, device_map="cuda:0")
llm_model = AutoModelForCausalLM.from_pretrained(
    qwen_model_name_or_path, device_map="cuda:0"
)

  return self.fget.__get__(instance, owner)()


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
llm_model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 2048)
    (layers): ModuleList(
      (0-35): 36 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (k_proj): Linear(in_features=2048, out_features=256, bias=True)
          (v_proj): Linear(in_features=2048, out_features=256, bias=True)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=2048, out_features=11008, bias=False)
          (up_proj): Linear(in_features=2048, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((2048,), eps=1e-06)
    (rotary_emb):

In [3]:
clip_model

CLIPModel(
  (text_model): CLIPTextTransformer(
    (embeddings): CLIPTextEmbeddings(
      (token_embedding): Embedding(49408, 768)
      (position_embedding): Embedding(77, 768)
    )
    (encoder): CLIPEncoder(
      (layers): ModuleList(
        (0-11): 12 x CLIPEncoderLayer(
          (self_attn): CLIPSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (layer_norm2): LayerNorm((768,), eps=1e

In [4]:
llm_tokenizer = AutoTokenizer.from_pretrained(qwen_model_name_or_path)
llm_tokenizer.encode("<image>")

[151665]

In [6]:
from transformers import (
    LlavaForConditionalGeneration,
    LlavaConfig
)

## 将clip模型和llm_model模型的config拿出来，初始化一个llava model

In [6]:
import torch
# 指定 device
device = "cuda:0"

# Initializing a CLIP-vision config
vision_config = clip_model.vision_model.config

# Initializing a Llama config
text_config = llm_model.config

# Initializing a Llava llava-1.5-7b style configuration
configuration = LlavaConfig(vision_config, text_config)

# Initializing a model from the llava-1.5-7b style configuration
model = LlavaForConditionalGeneration(configuration)

In [None]:
model

In [7]:
model.vision_tower.vision_model.embeddings

CLIPVisionEmbeddings(
  (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
  (position_embedding): Embedding(577, 1024)
)

In [8]:
clip_model.vision_model.embeddings

CLIPVisionEmbeddings(
  (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
  (position_embedding): Embedding(577, 1024)
)

#### 但是上面，只是把llava模型的形状初始化好了，模型权重都还是随机生成的，需要把两个模型的权重，复制过去

In [9]:
model.vision_tower.vision_model = clip_model.vision_model
model.language_model = llm_model

In [10]:
llm_model.model.embed_tokens.weight.data[:, :2]

tensor([[ 0.0391,  0.0142],
        [ 0.0112,  0.0142],
        [-0.0271, -0.0248],
        ...,
        [-0.0130,  0.0016],
        [-0.0130,  0.0016],
        [-0.0130,  0.0016]], device='cuda:0')

In [11]:
model.language_model.model.embed_tokens.weight.data[:, :2]

tensor([[ 0.0391,  0.0142],
        [ 0.0112,  0.0142],
        [-0.0271, -0.0248],
        ...,
        [-0.0130,  0.0016],
        [-0.0130,  0.0016],
        [-0.0130,  0.0016]], device='cuda:0')

复制pad_token_id

In [None]:
model.config.pad_token_id
llm_tokenizer.pad_token_id
llm_tokenizer

### 赋值Qwen模型的占位符id

In [12]:
model.config.pad_token_id = llm_tokenizer.pad_token_id
model.config.pad_token_id

151643

### 赋值image_token_index

In [13]:
model.config.image_token_index

32000

In [7]:
llm_tokenizer.encode("<image>")[0]

151665

In [15]:
model.config.image_token_index = llm_tokenizer.encode("<image>")[0]
model.config.image_token_index

151665

保存模型

In [None]:
model.save_pretrained("qwen2.5_3B_Instruct_clipvL14_model/v4.48.0/model001")

保存processor

In [5]:
for i in range(2000, 52000, 2000):
    print(i)
    llm_tokenizer.save_pretrained(f"./result_model/stage1/[v3.CC3M-Pretrain-595K]qwen2.5_3B_Instruct_clipvL14/checkpoint-{i}")

2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
28000
30000
32000
34000
36000
38000
40000
42000
44000
46000
48000
50000


In [9]:
autoprocessor = AutoProcessor.from_pretrained(clip_model_name_or_path)
for i in range(2000, 52000, 2000):
    autoprocessor.save_pretrained(f"./result_model/stage1/[v3.CC3M-Pretrain-595K]qwen2.5_3B_Instruct_clipvL14/checkpoint-{i}")

注意：
1. 主要需要把`show_model/model002`里面的`preprocessor_config.json`文件，放在`show_model/model001`里面

# 重启，开始测试效果

In [None]:
from transformers import LlavaProcessor, LlavaForConditionalGeneration
import torch


model_name_or_path = "qwen2.5_3B_Instruct_clipvL14_model/v4.48.0/model001"  # 
# model_name_or_path = "test_model_copy/model001"  #

llava_processor = LlavaProcessor.from_pretrained(model_name_or_path)
model = LlavaForConditionalGeneration.from_pretrained(
    model_name_or_path, device_map="cuda:0", torch_dtype=torch.bfloat16
)

In [None]:
llava_processor, llava_processor.__class__

In [None]:
from PIL import Image

prompt_text = "<image>\nWhat are these?"


messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt_text},
]
prompt = llava_processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)


image_path = "./data/000000039769.jpg"
image = Image.open(image_path)

inputs = llava_processor(text=prompt, images=image, return_tensors="pt")

for tk in inputs.keys():
    inputs[tk] = inputs[tk].to(model.device)

generate_ids = model.generate(**inputs, max_new_tokens=200)
gen_text = llava_processor.batch_decode(
    generate_ids, skip_special_tokens=False, clean_up_tokenization_spaces=False
)[0]

print(gen_text)

In [None]:
llava_processor

In [None]:
image

In [None]:
inputs

In [None]:
model.config