Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs #3070

Merged
merged 1 commit into from
Feb 28, 2024

Conversation

AllenDou
Copy link
Contributor

Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
import(pkg_name)
File "/root/vllm/vllm/init.py", line 4, in
from vllm.engine.async_llm_engine import AsyncLLMEngine
File "/root/vllm/vllm/engine/async_llm_engine.py", line 10, in
from vllm.engine.llm_engine import LLMEngine
File "/root/vllm/vllm/engine/llm_engine.py", line 14, in
from vllm.engine.metrics import StatLogger, Stats
File "/root/vllm/vllm/engine/metrics.py", line 2, in
from prometheus_client import Counter, Gauge, Histogram, REGISTRY, disable_created_metrics
ImportError: cannot import name 'disable_created_metrics' from 'prometheus_client' (/usr/local/lib/python3.10/dist-packages/prometheus_client/init.py)

restrict prometheus_client to >= 0.18.0 to prevent errors when importing pkgs

@simon-mo simon-mo enabled auto-merge (squash) February 28, 2024 04:14
@simon-mo simon-mo merged commit e46fa5d into vllm-project:main Feb 28, 2024
22 checks passed
@AllenDou AllenDou deleted the requirement-limit branch February 29, 2024 06:42
xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024
@grandiose-pizza
Copy link
Contributor

Hi,

@AllenDou
I want to use the metrics. I have exposed an API using the api_server.py

When I do a http://localhost:8075/metrics/, I get the following instead of seeing the values as described in the Metrics Class, How to see those metrics? :

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 6290.0
python_gc_objects_collected_total{generation="1"} 8336.0
python_gc_objects_collected_total{generation="2"} 4726.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 826.0
python_gc_collections_total{generation="1"} 75.0
python_gc_collections_total{generation="2"} 6.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.098353664e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.31774976e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71188972784e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 18.27
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 44.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

@AllenDou
Copy link
Contributor Author

AllenDou commented Apr 8, 2024

@grandiose-pizza Show me your startup command.

@grandiose-pizza
Copy link
Contributor

My entrypoint file

#!/bin/bash

# Load the model
# cd /app/app
# uvicorn app.main:app --reload --workers 1 --host 0.0.0.0 --port 8075
MODEL_VERSION="${MODEL_NAME}_${MODEL_TYPE}_${TRAINING_TYPE}_${MODEL_PHASE}_${CONTEXT_LENGTH}_${DATASET_VERSION}_${QUANTIZATION_TYPE}_${INFERENCE_MODE}"
MODEL_FOLDER_PATH="/app/models/$MODEL_DIRECTORY/$MODEL_VERSION"

python -m app.routes.text_generation_vllm --model $MODEL_FOLDER_PATH --trust-remote-code --tensor-parallel-size $((TENSOR_PARALLEL_SIZE)) --served-model-name $SERVED_MODEL_NAME --api-key $MODEL_API_KEY --port $((INFERENCE_PORT)) --worker-use-ray --engine-use-ray --disable-log-requests

Its loading a model from disk in $MODEL_FOLDER_PATH location.
$((TENSOR_PARALLEL_SIZE)) = 2
$((INFERENCE_PORT)) = 8075

app.routes.text_generation_vllm is same as the api_server.py but with the change that chat template doesn't exist and I instead use a custom build_prompt function in line by inheriting a few classes like ChatCompletionRequest and OpenAIServingChat

@AllenDou
Copy link
Contributor Author

AllenDou commented Apr 8, 2024

@grandiose-pizza Have you registered VLLM-related metrics to the Prometheus registry? Please refer to the Metrics class, eg: self.gauge_scheduler_running = Gauge(),
When /metrics/ is requested, the async def prometheus_app function will generate Prometheus output. Refer to the def make_asgi_app() function."

@grandiose-pizza
Copy link
Contributor

I didn't change anything in make_asgi_app() function. Its using the default things provided by vLLM. Do I need to add something specifically if I am using a new model? I am using the Jais models.

@AllenDou
Copy link
Contributor Author

AllenDou commented Apr 8, 2024

@grandiose-pizza The type of model and metrics are irrelevant, Jais is fine. Can I see your code?

@grandiose-pizza
Copy link
Contributor

grandiose-pizza commented Apr 8, 2024

I have folded my imports into this single file/ code as I can't upload the files.

I am using vllm==0.4.0

As you can see, the only change is that I use build_prompt (which in turns calls an API to formulate the prompt) instead of tokenizer.apply_chat_template and hence had to inherit the classes (ChatCompletionRequest and OpenAIServingChat) to change

import asyncio
from contextlib import asynccontextmanager
import os
import importlib
import inspect

from prometheus_client import make_asgi_app
import fastapi
import uvicorn
from http import HTTPStatus
from fastapi import Request
from fastapi.exceptions import RequestValidationError
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse, StreamingResponse, Response

import vllm
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.entrypoints.openai.protocol import (CompletionRequest,
                                              ChatCompletionRequest,
                                              ErrorResponse,
                                              ChatCompletionResponse,
                                            )
from vllm.logger import init_logger
from vllm.entrypoints.openai.cli_args import make_arg_parser
from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
from vllm.entrypoints.openai.serving_completion import OpenAIServingCompletion
from typing import AsyncGenerator, AsyncIterator, Optional, List, Union
from vllm.entrypoints.openai.serving_engine import LoRA
from vllm.utils import random_uuid
from vllm.model_executor.guided_decoding import (
    get_guided_decoding_logits_processor)
from pydantic import Field
from sse_starlette.sse import ServerSentEvent, EventSourceResponse
from vllm.usage.usage_lib import UsageContext
import requests
import json
from transformers import AutoTokenizer
import os


TIMEOUT_KEEP_ALIVE = 5  # seconds

openai_serving_chat: OpenAIServingChat = None
openai_serving_completion: OpenAIServingCompletion = None
logger = init_logger(__name__)


MODEL_FOLDER_PATH=f"/app/models/{cfg.MODEL_DIRECTORY}/{cfg.MODEL_VERSION}"
tokenizer = AutoTokenizer.from_pretrained(MODEL_FOLDER_PATH, trust_trmote_code=True)
INFERENCE_MODE = os.environ.get('INFERENCE_MODE')
MODEL_DIRECTORY = os.environ.get('MODEL_DIRECTORY')
TEMPERATURE = float(os.environ.get('TEMPERATURE'))
TOP_P = float(os.environ.get('TOP_P'))
NO_REPEAT_NGRAM = int(os.environ.get('NO_REPEAT_NGRAM'))
REPITION_PENALTY = float(os.environ.get('REPITION_PENALTY'))
MAX_LENGTH = int(os.environ.get('MAX_LENGTH'))
MAX_NEW_TOKENS = int(os.environ.get('MAX_NEW_TOKENS'))
MIN_LENGTH = int(os.environ.get('MIN_LENGTH'))
DO_SAMPLE = bool(os.environ.get('DO_SAMPLE'))
MODEL_TYPE = os.environ.get('MODEL_TYPE')
MODEL_NAME = os.environ.get('MODEL_NAME')
STREAM = bool(os.environ.get('STREAM'))
TRAINING_TYPE = os.environ.get('TRAINING_TYPE')
MODEL_PHASE = os.environ.get('MODEL_PHASE')
CONTEXT_LENGTH = os.environ.get('CONTEXT_LENGTH')
DATASET_VERSION = os.environ.get('DATASET_VERSION')
QUANTIZATION_TYPE = os.environ.get('QUANTIZATION_TYPE')
PROMPTER_URL = os.environ.get('PROMPTER_URL')
SERVED_MODEL_NAME = os.environ.get('SERVED_MODEL_NAME')
MODEL_API_KEY = os.environ.get('MODEL_API_KEY')
INFERENCE_PORT = int(os.environ.get('INFERENCE_PORT'))
MODEL_VERSION = '_'.join([MODEL_NAME, MODEL_TYPE, TRAINING_TYPE,
                          MODEL_PHASE, CONTEXT_LENGTH, DATASET_VERSION,
                          QUANTIZATION_TYPE, INFERENCE_MODE])

def is_system_present(messages):
    for message in messages:
        if message['role'] == "system":
            return True
    return False

def validate_messages(messages):
    if not isinstance(messages, list):
        raise ValueError("messages is not a list")
    if messages == []:
        raise ValueError("messages list is empty.")
    for message in messages:
        if not all(key in message for key in ['role', 'content']):
            raise ValueError("Invalid messages format")

def validate_prompt(messages):
    additional_tokens_placeholder = 700 # tags + default prompt + answer

    validate_messages(messages)

    if not is_system_present(messages):
        messages=[{"role": "system", "content": ""}]+messages

    if messages[-1]['role']!='user':
        messages.append({"role": "user", "content": "Hi there!"})

    if len(messages) == 2:
        messages = messages[:1] + [{"role": "user", "content": ""}] + messages[1:]

    assert messages[0]['role']=='system', "Invalid messages format"
    assert messages[-1]['role']=='user', "Invalid messages format"


    UQ = len(tokenizer(messages[-1]['content']).input_ids)
    if UQ > cfg.MAX_LENGTH-additional_tokens_placeholder:
        raise ValueError("Query is too long")


    if messages[1]['role']=='user' and messages[2]['role']=='user':
        RC = len(tokenizer(messages[1]['content']).input_ids)
        if RC >= cfg.MAX_LENGTH-UQ-additional_tokens_placeholder:
            raise ValueError("Query with RAG context is too long")  
    else:
        messages = messages[:1] + [{"role": "user", "content": ""}] + messages[1:]
        RC = 0

    sys_prompt_context = messages[:2]
    chat_history_query = messages[2:]

    assert chat_history_query[0]["role"]=="user", "Invalid messages format"

    chat_history_query_len = len(
        tokenizer(
            " ".join([k['content'] for k in chat_history_query])).input_ids)
    
    if len(chat_history_query)>1:
        while chat_history_query_len >= cfg.MAX_LENGTH-UQ-RC-additional_tokens_placeholder:
            chat_history_query = chat_history_query[2:]
            chat_history_query_len = len(
                tokenizer(
                    " ".join([k['content'] for k in chat_history_query])).input_ids)

    messages_new = sys_prompt_context + chat_history_query
    return messages_new


def build_prompt(messages):
    messages = validate_prompt(messages)
    sys_prompt = messages[0]['content']
    context = messages[1]['content']
    chat_history=messages[2:-1]
    payload = json.dumps({
        "sys_prompt": sys_prompt,
        "context": context,
        "user_query": user_query,
        "chat_history": chat_history})
    headers = {'Content-Type': 'application/json'}

    response = requests.request("POST", cfg.PROMPTER_URL, headers=headers, data=payload)
    if response.status_code == 200:
        return response.json()["formulated_prompt"]
    else:
        return None

class JaisChatCompletionRequest(ChatCompletionRequest):
    temperature: Optional[float] = cfg.TEMPERATURE
    top_p: Optional[float] = cfg.TOP_P
    repetition_penalty: Optional[float] = cfg.REPITION_PENALTY
    add_generation_prompt: Optional[bool] = Field(
        default=False,
        description=
        ("If true, the generation prompt will be added to the chat template. "
         "This is a parameter used by chat template in tokenizer config of the "
         "model."),
    )

class JaisServingChat(OpenAIServingChat):

    def __init__(self,
                 engine: AsyncLLMEngine,
                 served_model: str,
                 response_role: str,
                 lora_modules: Optional[List[LoRA]] = None,
                 chat_template=None):
        super().__init__(engine=engine,
                         served_model=served_model,
                         response_role=response_role,
                         lora_modules=lora_modules)
        self.response_role = response_role
        self._load_chat_template(chat_template)

    async def create_chat_completion(
        self, request: JaisChatCompletionRequest, raw_request: Request
    ) -> Union[ErrorResponse, AsyncGenerator[str, None],
               ChatCompletionResponse]:
        """Completion API similar to OpenAI's API.

        See https://platform.openai.com/docs/api-reference/chat/create
        for the API specification. This API mimics the OpenAI
        ChatCompletion API.

        NOTE: Currently we do not support the following feature:
            - function_call (Users should implement this by themselves)
        """
        error_check_ret = await self._check_model(request)
        if error_check_ret is not None:
            return error_check_ret

        try:
            prompt = build_prompt(request.messages)
        except Exception as e:
            logger.error(
                f"Error in applying chat template from request: {str(e)}")
            return self.create_error_response(str(e))

        request_id = f"cmpl-{random_uuid()}"
        try:
            token_ids = self._validate_prompt_and_tokenize(request,
                                                           prompt=prompt)
            sampling_params = request.to_sampling_params()
            lora_request = self._maybe_get_lora(request)
            guided_decode_logits_processor = (
                await get_guided_decoding_logits_processor(
                    request, await self.engine.get_tokenizer()))
            if guided_decode_logits_processor:
                if sampling_params.logits_processors is None:
                    sampling_params.logits_processors = []
                sampling_params.logits_processors.append(
                    guided_decode_logits_processor)
        except ValueError as e:
            return self.create_error_response(str(e))

        result_generator = self.engine.generate(prompt, sampling_params,
                                                request_id, token_ids,
                                                lora_request)
        # Streaming response
        if request.stream:
            return self.chat_completion_stream_generator(
                request, result_generator, request_id)
        else:
            try:
                return await self.chat_completion_full_generator(
                    request, raw_request, result_generator, request_id)
            except ValueError as e:
                # TODO: Use a vllm-specific Validation Error
                return self.create_error_response(str(e))


@asynccontextmanager
async def lifespan(app: fastapi.FastAPI):

    async def _force_log():
        while True:
            await asyncio.sleep(10)
            await engine.do_log_stats()

    if not engine_args.disable_log_stats:
        asyncio.create_task(_force_log())

    yield


app = fastapi.FastAPI(lifespan=lifespan)


def parse_args():
    parser = make_arg_parser()
    return parser.parse_args()


# Add prometheus asgi middleware to route /metrics requests
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)


@app.exception_handler(RequestValidationError)
async def validation_exception_handler(_, exc):
    err = openai_serving_chat.create_error_response(message=str(exc))
    return JSONResponse(err.model_dump(), status_code=HTTPStatus.BAD_REQUEST)


@app.get("/health")
async def health() -> Response:
    """Health check."""
    await openai_serving_chat.engine.check_health()
    return Response(status_code=200)


@app.get("/v1/models")
async def show_available_models():
    models = await openai_serving_chat.show_available_models()
    return JSONResponse(content=models.model_dump())


@app.get("/version")
async def show_version():
    ver = {"version": vllm.__version__}
    return JSONResponse(content=ver)

async def yield_gen(gen):
    async for val in gen:
        yield(ServerSentEvent(val))


@app.post("/v1/chat/completions")
async def create_chat_completion(request: JaisChatCompletionRequest,
                                 raw_request: Request):
    generator = await openai_serving_chat.create_chat_completion(
        request, raw_request)
    if isinstance(generator, ErrorResponse):
        return JSONResponse(content=generator.model_dump(),
                            status_code=generator.code)
    if request.stream:
        return StreamingResponse(content=generator,
                                 media_type="text/event-stream")
    else:
        return JSONResponse(content=generator.model_dump())


@app.post("/v1/completions")
async def create_completion(request: CompletionRequest, raw_request: Request):
    generator = await openai_serving_completion.create_completion(
        request, raw_request)
    if isinstance(generator, ErrorResponse):
        return JSONResponse(content=generator.model_dump(),
                            status_code=generator.code)
    if request.stream:
        return StreamingResponse(content=generator,
                                 media_type="text/event-stream")
    else:
        return JSONResponse(content=generator.model_dump())


@app.post("/v1/chat/completions_sse")
async def create_chat_completion_sse(request: JaisChatCompletionRequest,
                                 raw_request: Request):
    generator = await openai_serving_chat.create_chat_completion(
        request, raw_request)
    if isinstance(generator, ErrorResponse):
        return JSONResponse(content=generator.model_dump(),
                            status_code=generator.code)
    if request.stream:
        return EventSourceResponse(yield_gen(generator))
    else:
        return JSONResponse(content=generator.model_dump())


@app.post("/v1/completions_sse")
async def create_completion_sse(request: CompletionRequest, raw_request: Request):
    generator = await openai_serving_completion.create_completion(
        request, raw_request)
    if isinstance(generator, ErrorResponse):
        return JSONResponse(content=generator.model_dump(),
                            status_code=generator.code)
    if request.stream:
        return EventSourceResponse(yield_gen(generator))
    else:
        return JSONResponse(content=generator.model_dump())


if __name__ == "__main__":
    args = parse_args()

    app.add_middleware(
        CORSMiddleware,
        allow_origins=args.allowed_origins,
        allow_credentials=args.allow_credentials,
        allow_methods=args.allowed_methods,
        allow_headers=args.allowed_headers,
    )

    if token := os.environ.get("VLLM_API_KEY") or args.api_key:

        @app.middleware("http")
        async def authentication(request: Request, call_next):
            if not request.url.path.startswith("/v1"):
                return await call_next(request)
            if request.headers.get("Authorization") != "Bearer " + token:
                return JSONResponse(content={"error": "Unauthorized"},
                                    status_code=401)
            return await call_next(request)

    for middleware in args.middleware:
        module_path, object_name = middleware.rsplit(".", 1)
        imported = getattr(importlib.import_module(module_path), object_name)
        if inspect.isclass(imported):
            app.add_middleware(imported)
        elif inspect.iscoroutinefunction(imported):
            app.middleware("http")(imported)
        else:
            raise ValueError(f"Invalid middleware {middleware}. "
                             f"Must be a function or a class.")

    logger.info(f"vLLM API server version {vllm.__version__}")
    logger.info(f"args: {args}")

    if args.served_model_name is not None:
        served_model = args.served_model_name
    else:
        served_model = args.model

    engine_args = AsyncEngineArgs.from_cli_args(args)
    engine = AsyncLLMEngine.from_engine_args(
        engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
    openai_serving_chat = JaisServingChat(engine, served_model,
                                            args.response_role,
                                            args.lora_modules,
                                            args.chat_template)
    openai_serving_completion = OpenAIServingCompletion(
        engine, served_model, args.lora_modules)

    app.root_path = args.root_path
    uvicorn.run(app,
                host=args.host,
                port=args.port,
                log_level=args.uvicorn_log_level,
                timeout_keep_alive=TIMEOUT_KEEP_ALIVE,
                ssl_keyfile=args.ssl_keyfile,
                ssl_certfile=args.ssl_certfile,
                ssl_ca_certs=args.ssl_ca_certs,
                ssl_cert_reqs=args.ssl_cert_reqs)

@AllenDou
Copy link
Contributor Author

AllenDou commented Apr 9, 2024

@grandiose-pizza the problem is '--engine-use-ray', let's me find out why.
Meanwhile, you can delete '--engine-use-ray', and try again.

@grandiose-pizza
Copy link
Contributor

Hi @AllenDou , This worked. Thanks a ton. Is this a bug, Should we try to create a PR to fix this for '--engine-use-ray'?

I am not sure how much of a performance difference it create with and without this parameter.

@AllenDou
Copy link
Contributor Author

AllenDou commented Apr 9, 2024

@grandiose-pizza FYI #3938

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants