Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

希望服务稳定性提升 #1089

Closed
yinghaodang opened this issue Mar 6, 2024 · 3 comments
Closed

希望服务稳定性提升 #1089

yinghaodang opened this issue Mar 6, 2024 · 3 comments
Milestone

Comments

@yinghaodang
Copy link

yinghaodang commented Mar 6, 2024

我部署了xinference,版本是0.9.0.
一旦对显存的请求过大,服务就会hang住。
表现为:刷新UI没有任何模型显示,所有正在运行的模型全部异常退出,无法自动恢复。
日志信息表示:Out of memory.

目前解决方案:杀死容器,重新部署。

我心目中的服务,当自身负载过高时,有序处理当下任务,临时拒绝额外请求。保证服务可用。

@XprobeBot XprobeBot added this to the v0.9.2 milestone Mar 6, 2024
@qinxuye
Copy link
Contributor

qinxuye commented Mar 6, 2024

日志完整的能贴一下吗

@XprobeBot XprobeBot modified the milestones: v0.9.2, v0.9.3 Mar 8, 2024
@yinghaodang
Copy link
Author

我复现一下我的操作过程,方便复现。

version: '3.8'

services:
  xinference-local:
    image: xprobe/xinference:v0.9.0
    container_name: xinference-local
    ports:
      - 9998:9997
    environment:
      - XINFERENCE_MODEL_SRC=modelscope
      - XINFERENCE_HOME=/root/MODEL_PATH
    volumes:
      - ${MODEL_PATH}:/root/MODEL_PATH
    restart: always
    shm_size: '128g'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: xinference-local -H 0.0.0.0 --log-level debug
    networks:
      - xinference-local
networks:
  xinference-local:
    driver: bridge
    ipam:
      driver: default
      config:
        - subnet: "172.30.2.0/24"

${MODEL_PATH}是一个文件夹,里面的子文件夹是模型文件夹。
我的显卡是4张英伟达的A100,然后在部署页面选择部署了72B int4量化的Qwen(这个是使用Xinference平台下载的模型)。同样方法测试过Baichuan2-7b-chat模型,都是独占4张卡,都是选择pytorch的模型。
下面是测试大模型使用的代码

from ragas import evaluate

from langchain.chat_models import ChatOpenAI
from ragas.llms.base import LangchainLLMWrapper

inference_server_url = ""

# create vLLM Langchain instance
chat = ChatOpenAI(
   model="Baichuan2-13B-Chat",
   openai_api_key="no-key",
   openai_api_base=inference_server_url,
   max_tokens=1024,
   temperature=0.5,
)

# use the Ragas LangchainLLM wrapper to create a RagasLLM instance
vllm = LangchainLLMWrapper(chat)

from ragas.metrics import (
   context_precision,
   answer_relevancy,
   faithfulness,
   context_recall,
   context_relevancy,
   answer_correctness,
   answer_similarity
)
from ragas.metrics.critique import harmfulness

# change the LLM

faithfulness.llm = vllm
context_relevancy.llm = vllm
context_recall.llm = vllm
context_precision.llm = vllm
answer_similarity.llm = vllm
answer_relevancy.llm = vllm # Invalid key: 0 is out of bounds for size 0
answer_correctness.llm = vllm
harmfulness.llm = vllm

from langchain.embeddings import HuggingFaceEmbeddings

modelPath = "bge-large-zh"
# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device':'cpu'}
# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': True}
# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
   model_name=modelPath,     # Provide the pre-trained model's path
   model_kwargs=model_kwargs, # Pass the model configuration options
   encode_kwargs=encode_kwargs # Pass the encoding options
)

answer_relevancy.embeddings = embeddings
answer_correctness.embeddings = embeddings
answer_similarity.embeddings = embeddings

# evaluate
from ragas import evaluate

result = evaluate(
   dataset,
   metrics=[context_precision], # 1
)

print(result)

测试代码大概是这个意思,删删改改多次,可能有bug。核心调用大模型的其实就是最后一句result = evaluate(
dataset,
metrics=[context_precision], # 1
)
我测试的时候dataset大概只有100条数据,后台观察nvidia-smi,会发现显存占用率一直攀升,直到占满显存,显存占用归0.
后台日志就是普通的Out of memory.(现在卡还在其他用途,之后有空闲了再贴)
然后Web前端表现为Hang住,无法查看正在运行的模型,也无法重新部署模型。

@XprobeBot XprobeBot modified the milestones: v0.9.3, v0.9.4, v0.9.5 Mar 15, 2024
@oubeichen
Copy link

所以是怎么解决的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants