希望服务稳定性提升 #1089

yinghaodang · 2024-03-06T12:37:59Z

我部署了xinference，版本是0.9.0.
一旦对显存的请求过大，服务就会hang住。
表现为：刷新UI没有任何模型显示，所有正在运行的模型全部异常退出，无法自动恢复。
日志信息表示：Out of memory.

目前解决方案：杀死容器，重新部署。

我心目中的服务，当自身负载过高时，有序处理当下任务，临时拒绝额外请求。保证服务可用。

qinxuye · 2024-03-06T12:41:51Z

日志完整的能贴一下吗

yinghaodang · 2024-03-11T02:12:57Z

我复现一下我的操作过程，方便复现。

version: '3.8'

services:
  xinference-local:
    image: xprobe/xinference:v0.9.0
    container_name: xinference-local
    ports:
      - 9998:9997
    environment:
      - XINFERENCE_MODEL_SRC=modelscope
      - XINFERENCE_HOME=/root/MODEL_PATH
    volumes:
      - ${MODEL_PATH}:/root/MODEL_PATH
    restart: always
    shm_size: '128g'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: xinference-local -H 0.0.0.0 --log-level debug
    networks:
      - xinference-local
networks:
  xinference-local:
    driver: bridge
    ipam:
      driver: default
      config:
        - subnet: "172.30.2.0/24"

${MODEL_PATH}是一个文件夹，里面的子文件夹是模型文件夹。
我的显卡是4张英伟达的A100，然后在部署页面选择部署了72B int4量化的Qwen(这个是使用Xinference平台下载的模型)。同样方法测试过Baichuan2-7b-chat模型，都是独占4张卡，都是选择pytorch的模型。
下面是测试大模型使用的代码

from ragas import evaluate

from langchain.chat_models import ChatOpenAI
from ragas.llms.base import LangchainLLMWrapper

inference_server_url = ""

# create vLLM Langchain instance
chat = ChatOpenAI(
   model="Baichuan2-13B-Chat",
   openai_api_key="no-key",
   openai_api_base=inference_server_url,
   max_tokens=1024,
   temperature=0.5,
)

# use the Ragas LangchainLLM wrapper to create a RagasLLM instance
vllm = LangchainLLMWrapper(chat)

from ragas.metrics import (
   context_precision,
   answer_relevancy,
   faithfulness,
   context_recall,
   context_relevancy,
   answer_correctness,
   answer_similarity
)
from ragas.metrics.critique import harmfulness

# change the LLM

faithfulness.llm = vllm
context_relevancy.llm = vllm
context_recall.llm = vllm
context_precision.llm = vllm
answer_similarity.llm = vllm
answer_relevancy.llm = vllm # Invalid key: 0 is out of bounds for size 0
answer_correctness.llm = vllm
harmfulness.llm = vllm

from langchain.embeddings import HuggingFaceEmbeddings

modelPath = "bge-large-zh"
# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device':'cpu'}
# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': True}
# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
   model_name=modelPath,     # Provide the pre-trained model's path
   model_kwargs=model_kwargs, # Pass the model configuration options
   encode_kwargs=encode_kwargs # Pass the encoding options
)

answer_relevancy.embeddings = embeddings
answer_correctness.embeddings = embeddings
answer_similarity.embeddings = embeddings

# evaluate
from ragas import evaluate

result = evaluate(
   dataset,
   metrics=[context_precision], # 1
)

print(result)

测试代码大概是这个意思，删删改改多次，可能有bug。核心调用大模型的其实就是最后一句result = evaluate(
dataset,
metrics=[context_precision], # 1
)
我测试的时候dataset大概只有100条数据，后台观察nvidia-smi，会发现显存占用率一直攀升，直到占满显存，显存占用归0.
后台日志就是普通的Out of memory.(现在卡还在其他用途，之后有空闲了再贴)
然后Web前端表现为Hang住，无法查看正在运行的模型，也无法重新部署模型。

oubeichen · 2024-08-16T03:11:30Z

所以是怎么解决的

XprobeBot added this to the v0.9.2 milestone Mar 6, 2024

XprobeBot modified the milestones: v0.9.2, v0.9.3 Mar 8, 2024

XprobeBot modified the milestones: v0.9.3, v0.9.4, v0.9.5 Mar 15, 2024

yinghaodang closed this as completed Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

希望服务稳定性提升 #1089

希望服务稳定性提升 #1089

yinghaodang commented Mar 6, 2024 •

edited

Loading

qinxuye commented Mar 6, 2024

yinghaodang commented Mar 11, 2024

oubeichen commented Aug 16, 2024

希望服务稳定性提升 #1089

希望服务稳定性提升 #1089

Comments

yinghaodang commented Mar 6, 2024 • edited Loading

qinxuye commented Mar 6, 2024

yinghaodang commented Mar 11, 2024

oubeichen commented Aug 16, 2024

yinghaodang commented Mar 6, 2024 •

edited

Loading