Skip to content

Support for vLLM chat completion endpoint #778

@noobHappylife

Description

@noobHappylife

Currently, i noticed the module HFClientVLLM only calls the completion endpoints. As vLLM now support applying chat templates. Would it be possible to add support for chat completion endpoints?

the following is a naive approach to add a chat completion support, edited the _generate method for HFClientVLLM

    def _generate(self, prompt, **kwargs):
        kwargs = {**self.kwargs, **kwargs}

        model_type = kwargs.get("model_type",None)
        url = self.urls.pop(0)
        self.urls.append(url)

        system_prompt = kwargs.get("system_prompt",None)
        if model_type == "chat":
            messages = [{"role": "user", "content": prompt}]
            if system_prompt:
                messages.insert(0, {"role": "system", "content": system_prompt})
            payload = {
                "model": self.kwargs["model"],
                "messages": messages,
                **kwargs,
            }
            response = send_hfvllm_chat_request_v00(
                f"{url}/v1/chat/completions",
                json=payload,
                headers=self.headers,
            )

            try:
                json_response = response.json()
                completions = json_response["choices"]
                response = {
                    "prompt": prompt,
                    "choices": [{"text": c["message"]['content']} for c in completions],
                }
                return response

            except Exception:
                print("Failed to parse JSON response:", response.text)
                raise Exception("Received invalid JSON response from server")
        else:
            payload = {
                "model": self.kwargs["model"],
                "prompt": prompt,
                **kwargs,
            }

            response = send_hfvllm_request_v00(
                f"{url}/v1/completions",
                json=payload,
                headers=self.headers,
            )

            try:
                json_response = response.json()
                completions = json_response["choices"]
                response = {
                    "prompt": prompt,
                    "choices": [{"text": c["text"]} for c in completions],
                }
                return response

            except Exception:
                print("Failed to parse JSON response:", response.text)
                raise Exception("Received invalid JSON response from server")


@CacheMemory.cache
def send_hfvllm_request_v00(arg, **kwargs):
    return requests.post(arg, **kwargs)

@CacheMemory.cache
def send_hfvllm_chat_request_v00(arg, **kwargs):
    return requests.post(arg, **kwargs)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions