# Using LLama Factory finetune on SageMaker 
# 2. 使用vLLM进行本地推理

## 安装依赖包

In [1]:
!pip install vllm==0.4.3 bitsandbytes

Collecting vllm==0.4.3
  Downloading vllm-0.4.3-cp310-cp310-manylinux1_x86_64.whl.metadata (7.8 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl.metadata (2.2 kB)
Collecting cmake>=3.21 (from vllm==0.4.3)
  Downloading cmake-3.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.1 kB)
Collecting ninja (from vllm==0.4.3)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Collecting sentencepiece (from vllm==0.4.3)
  Downloading sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting py-cpuinfo (from vllm==0.4.3)
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)
Collecting transformers>=4.40.0 (from vllm==0.4.3)
  Downloading transformers-4.42.3-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[

In [None]:
### 从s3下载模型文件到本地

In [4]:
import boto3
import pprint
from tqdm import tqdm
import sagemaker
sagemaker_session =  sagemaker.session.Session() #sagemaker.session.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [5]:
print(default_bucket)

sagemaker-us-east-1-342367142984


In [6]:
!aws s3 sync s3://{default_bucket}/llama3-8b-qlora/ ./local_model

download: s3://sagemaker-us-east-1-342367142984/llama3-8b-qlora/finetuned_model/adapter_config.json to local_model/finetuned_model/adapter_config.json
download: s3://sagemaker-us-east-1-342367142984/llama3-8b-qlora/finetuned_model/all_results.json to local_model/finetuned_model/all_results.json
download: s3://sagemaker-us-east-1-342367142984/llama3-8b-qlora/finetuned_model/checkpoint-160/README.md to local_model/finetuned_model/checkpoint-160/README.md
download: s3://sagemaker-us-east-1-342367142984/llama3-8b-qlora/finetuned_model/checkpoint-160/scheduler.pt to local_model/finetuned_model/checkpoint-160/scheduler.pt
download: s3://sagemaker-us-east-1-342367142984/llama3-8b-qlora/finetuned_model/checkpoint-160/rng_state.pth to local_model/finetuned_model/checkpoint-160/rng_state.pth
download: s3://sagemaker-us-east-1-342367142984/llama3-8b-qlora/finetuned_model/README.md to local_model/finetuned_model/README.md
download: s3://sagemaker-us-east-1-342367142984/llama3-8b-qlora/finetuned_mo

## 加载模型tokenizer

In [9]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

In [10]:
model_id = 'TechxGenus/Meta-Llama-3-8B-Instruct-AWQ'
tokenizer = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## 加载sample数据，用于对比

In [11]:
from datasets import load_dataset
from random import randrange
# dataset_name = "zxbsmk/webnovel_cn"
dataset_name = "hfl/ruozhiba_gpt4"
# Load dataset from the hub
train_dataset = load_dataset(dataset_name, split="train",revision='41d2c61beb86c8d4c61916cc656c39d018c40ce5')

In [12]:

print(f"Training size: {len(train_dataset)}")
print("\nTraining sample:\n")
num_samples = 200
print(train_dataset[randrange(num_samples)])

Training size: 4898

Training sample:

{'input': '', 'instruction': '屁股上不知谁他妈给我扎个窟窿。 今天才发现。吓死我了。我还有救么。？？', 'output': '首先，发现身体上有不明原因的伤口时，不要惊慌失措。你需要尽快洗净伤口、消毒并进行简单的包扎，然后及时去看医生，进行专业的检查和治疗。如果伤口不深且没有感染，在医生的指导下通常是可以治愈的。但如果伤口较深或已有感染，专业的医疗干预是必需的。康复的前提是你需要尽快就医并按照医生的建议进行护理。'}


In [13]:
sql_lora_path = './local_model/finetuned_model'

## 使用本地的vLLM部署

In [14]:
from vllm.lora.request import LoRARequest
from vllm import LLM,SamplingParams
from transformers import AutoModelForCausalLM, AutoTokenizer


In [15]:
model_id = 'TechxGenus/Meta-Llama-3-8B-Instruct-AWQ'
tokenizer = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [16]:
llm = LLM(model=model_id,max_model_len=4096,enable_lora=True)

config.json:   0%|          | 0.00/885 [00:00<?, ?B/s]

INFO 07-10 16:29:57 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='TechxGenus/Meta-Llama-3-8B-Instruct-AWQ', speculative_config=None, tokenizer='TechxGenus/Meta-Llama-3-8B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=TechxGenus/Meta-Llama-3-8B-Instruct-AWQ)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


generation_config.json:   0%|          | 0.00/152 [00:00<?, ?B/s]

INFO 07-10 16:29:59 weight_utils.py:207] Using model weights format ['*.safetensors']


model-00001-of-00002.safetensors:   0%|          | 0.00/4.68G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.05G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/63.5k [00:00<?, ?B/s]

INFO 07-10 16:30:13 model_runner.py:146] Loading model weights took 5.3479 GB
INFO 07-10 16:30:15 gpu_executor.py:83] # GPU blocks: 6586, # CPU blocks: 2048
INFO 07-10 16:30:18 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-10 16:30:18 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-10 16:30:26 model_runner.py:924] Graph capturing finished in 8 secs.


In [21]:
#测试第一个消息
# messages = [
#     {"role": "system", "content":"请始终用中文回答"},
#      {"role": "user", "content": "你是谁？你是干嘛的"},
# ]

# 测试第二个消息
messages = [
    {"role": "system", "content":"请始终用中文回答"},
     {"role": "user", "content": "睡觉时被女鬼压床我该怎么办？"},
]


inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

### 使用原始模型进行推理

In [22]:
sampling_params = SamplingParams(temperature=0.1, top_p=0.95,max_tokens=512)

outputs = llm.generate(inputs, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt:\n{prompt!r}")
    print(f"Response:\n{generated_text!r}")


Processed prompts: 100%|██████████| 1/1 [00:04<00:00,  4.44s/it, Generation Speed: 66.20 toks/s]

Prompt:
'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n请始终用中文回答<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n睡觉时被女鬼压床我该怎么办？<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
Response:
'如果你认为自己被女鬼压床，这可能是由于某种原因或 superstition（超stitious）而引起的。以下是一些可能有助的建议：\n\n1. **保持理智**：在处理这种情况时，保持理智和冷静是非常重要的。不要让恐惧和担忧控制你的思想。\n2. **检查环境**：检查你的睡眠环境是否有什么可能会导致这种情况的原因。例如，是否有某种异响或异味？\n3. **改变睡眠环境**：如果你发现某种原因，尝试改变睡眠环境。例如，移到另一个房间或使用不同的床。\n4. **寻求支持**：如果你感到非常害怕或不安全，可以寻求支持。和朋友或家人谈论你的感受，或者寻求专业人士的帮助。\n5. **实践自我保护**：如果你确实感到被女鬼压床，可以尝试一些自我保护的方法。例如，使用护身符、念经祈祷或使用某种保护仪式。\n\n需要注意的是，这些方法可能不一定能够解决问题，但它们可以帮助你感到更安全和更有控制感。\n\n最后，如果你感到非常害怕或不安全，可以考虑寻求专业人士的帮助，例如心理医生或灵媒。'





### 加载Lora进行推理

In [23]:
sql_lora_path = './local_model/finetuned_model'

In [24]:
outputs = llm.generate(inputs, sampling_params,lora_request=LoRARequest("adapter", 1, sql_lora_path))

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt:\n{prompt!r}")
    print(f"Response:\n{generated_text!r}")

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.46s/it, Generation Speed: 62.91 toks/s]

Prompt:
'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n请始终用中文回答<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n睡觉时被女鬼压床我该怎么办？<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
Response:
'睡觉时被女鬼压床这种情况在传统文化中被称为“被压梦”，是一种常见的梦境现象。这种现象通常是由于梦境过于激动人心、情绪过高或是心理压力大的原因。为了避免这种情况，可以通过调整睡眠环境、减少压力、进行放松技巧和理性思考来预防。'



