## vLLM-LMI Mixtral-8x7B-DPO-AWQ deployment guide

### In this tutorial, you will use vllm backend of Large Model Inference(LMI) DLC to deploy Mixtral-8x7B-DPO-AWQ and run inference with it.

Please make sure the following permission granted before running the notebook:

* S3 bucket push access
* SageMaker access




### Step 1: Let's bump up SageMaker and import stuff

In [1]:
%pip install sagemaker --upgrade  --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install transformers sentencepiece --upgrade  --quiet

Collecting transformers
  Downloading transformers-4.38.2-py3-none-any.whl.metadata (130 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.7/130.7 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers)
  Downloading huggingface_hub-0.21.4-py3-none-any.whl.metadata (13 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2023.12.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m760.0 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting tokenizers<0.19,>=0.14 (from transformers)
  Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Dow

In [2]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()

### Step 2: Start preparing model artifacts

In LMI container, we expect some artifacts to help setting up the model

* serving.properties (required): Defines the model server settings
* model.py (optional): A python file to define the core inference logic
* requirements.txt (optional): Any additional pip wheel need to install

In [8]:
%%writefile serving.properties
engine=Python
option.model_id=TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQ
option.tensor_parallel_degree=4
option.max_rolling_batch_size=8
option.rolling_batch=vllm
option.task=text-generation
option.dtype=fp16
option.quantize=awq
option.max_model_len=8192

Writing serving.properties


In [9]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

mymodel/
mymodel/serving.properties


### Step 3: Start building SageMaker endpoint

#### Getting the container image URI

In [10]:
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.26.0"
    )

#### Upload artifact on S3 and create SageMaker model

In [26]:
model_name = "TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQ"
s3_code_prefix = f"large-model-vllm/{model_name}code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-596899493901/large-model-vllm/TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQcode/mymodel.tar.gz


#### Create SageMaker endpoint with a specified instance type

In [27]:
instance_type = "ml.g4dn.12xlarge"
endpoint_name = sagemaker.utils.name_from_base(f"lmi-model-{model_name.replace('/', '-')}")
print(f"endpoint_name: {endpoint_name}")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             container_startup_health_check_timeout=1800
            )

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
)

endpoint_name: lmi-model-TheBloke-Nous-Hermes-2-Mixtra-2024-03-21-03-25-49-158
---------------!

### Step 4: Run inference

In [28]:
system_message=""
input_text = "请解释一下AI"
prompt_template=f'''<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{input_text}<|im_end|>
<|im_start|>assistant
'''

In [29]:
%%time
response = predictor.predict(
    {
        "inputs": prompt_template, 
         "parameters": {
            "max_new_tokens":128,
            "do_sample":True,
            "temperature":0.7,
            "top_p":0.95,
            "top_k":40,
            "repetition_penalty":1.1
            }
    }
)
text = str(response, 'utf-8')
text

CPU times: user 15.4 ms, sys: 0 ns, total: 15.4 ms
Wall time: 6.5 s


'{"generated_text": "AI，全称为人工智能（Artificial Intelligence），是指计算机系统通过学习、推理和自我改进的能力模拟人类智能行为和思维过程。这意味着它可以执行复杂的任务并从经验中学习并提高其性能。AI主要分为强AI（AGI，Artificial General Intelligence）和弱AI（ANI，Artificial Narrow Intelligence）两大类别。弱AI具有特定领域或特定"}'

In [24]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()