# vLLM适配笔记

## pytorch
```
pip install torch==2.4.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/test/cu121
# 必须指定numpy版本，否则会出错
pip install numpy==1.24.4
```
- 安装过程中，重新安装了`nvidia-cuda-runtime-cu12==12.1.105`，可能不需要使用cuda预制镜像，在安装torch过程中会自动安装cuda环境
- 也有可能不需要`--index-url https://download.pytorch.org/whl/test/cu121`，只安装torch相关依赖，还需验证

In [1]:
# 测试CUDA环境
import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.current_device())
print(torch.cuda.get_device_name(0))

True
4
0
NVIDIA GeForce RTX 3090


## vllm 0.5.4
```
# 安装对应版本vllm
wget https://github.com/vllm-project/vllm/releases/download/v0.5.4/vllm-0.5.4-cp310-cp310-manylinux1_x86_64.whl
pip install vllm-0.5.4-cp310-cp310-manylinux1_x86_64.whl

# 安装推理所需python库
pip install bitsandbytes
pip install peft==0.4.0

# 安装 C/C++ 编译器
apt-get install build-essential

# 安装 dev版python
apt install python3.10-dev
```
- 创建软连接
```
ln -s /opt/data/lora/ .
ln -s /opt/data/cache/ .
```

### 测试加载已有模型

In [None]:
import torch
from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    GenerationConfig
)
from peft import PeftModel

model_path = "cache/models--baichuan-inc--Baichuan2-7B-Chat/snapshots/ea66ced17780ca3db39bc9f8aa601d8463db3da5"
lora_path = "lora/baichuan7B-data2text-continue"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    config=config,
    device_map="cuda",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(
    model,
    lora_path,
)

model.eval()

  @torch.library.impl_abstract("xformers_flash::flash_fwd")
  @torch.library.impl_abstract("xformers_flash::flash_bwd")


### 测试简单推理

In [None]:
inputs = tokenizer('登鹳雀楼->王之涣\n夜雨寄北->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

### 测试LLM推理
- 执行前建议**重启内核**

In [None]:
import os
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    GenerationConfig
)

model_path = "./cache/models--baichuan-inc--Baichuan2-7B-Chat/snapshots/ea66ced17780ca3db39bc9f8aa601d8463db3da5"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)

llm = LLM(model_path, enable_lora=True, trust_remote_code=True)

sampling_params = SamplingParams(
    temperature=0,
    max_tokens=256,
    stop=["[/assistant]"]
)

prompts = [
     "[user] 今天天气怎么样 [/user] [assistant]",
]

outputs = llm.generate(
    prompts,
    sampling_params
)

print(outputs)

## 参考
- [torch 2.4.0+cu121 quickstart](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html)
- https://pytorch.org/get-started/locally/
- [解决Python.h: No such file or directory](https://blog.csdn.net/dqchouyang/article/details/119571456)
- [apt-get update 全部 ign怎么办](https://www.cnblogs.com/ldy233/p/13216860.html)
- [Failed to find C compiler. Please specify via CC environment variable](https://github.com/vllm-project/vllm/issues/2997)
- [numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject](https://stackoverflow.com/questions/78634235/numpy-dtype-size-changed-may-indicate-binary-incompatibility-expected-96-from)
- [解决vllm部署时遇到的CUDA对应版本问题](https://juejin.cn/post/7386493960938176521)
- https://github.com/vllm-project/vllm/releases
- https://docs.vllm.ai/en/latest/dev/offline_inference/llm.html
- https://docs.vllm.ai/en/latest/models/lora.html
- [给vllm添加热添加lora的功能](https://www.cnblogs.com/alphainf/p/18227171)

### Tips
- CSDN和百度开发者中心的文章不要看，徒增抑郁+浪费生命