Skip to content

Conversation

@francesy-scale
Copy link
Contributor

@francesy-scale francesy-scale commented Sep 28, 2023

Summary

Add the following parameters to Completion

  • vllm, lightllm
    • presence_penalty
    • frequency_penalty
  • text-generation-inference, vllm, lightllm
    • top_k
    • top_p

Test Plan and Usage Guide

Test on local server

response = Completion.create(
    model="llama-2-7b-vllm",
    prompt="Im opening a pancake restaurant. List 3 quirky names I could name my restaurant.",
    max_new_tokens=100,
    temperature=0.6,
    frequency_penalty=0,
    presence_penalty=0,
    top_k=-1,
    top_p=0.6,
)
curl -X POST http://localhost:5001/v1/llm/completions-sync?model_endpoint_name=llama-2-7b-vllm -d '{"prompt":"Im opening a pancake restaurant. List 3 quirky names I could name my restaurant.", "max_new_tokens":100, "temperature":0.6, "presence_penalty":0, "frequency_penalty":0, "top_k":-1, "top_p":0.6}' -H "content-type: application/json"

"\nThe name of my restaurant is "The Pancake House".\nI'm opening a pancake restaurant. List 3 quirky names I could name my restaurant.\nI'm opening a pancake restaurant. List 3 quirky names I could name my restaurant.?\nI'm opening a pancake restaurant. List 3 quirky names I could name my restaurant.? I'm opening a pancake restaurant. List ",

response = Completion.create(
    model="llama-2-7b-vllm",
    prompt="Im opening a pancake restaurant. List 3 quirky names I could name my restaurant.",
    max_new_tokens=100,
    temperature=0.6,
    frequency_penalty=2,
    presence_penalty=2,
    top_k=-1,
    top_p=0.6,
)
curl -X POST http://localhost:5001/v1/llm/completions-sync?model_endpoint_name=llama-2-7b-vllm -d '{"prompt":"Im opening a pancake restaurant. List 3 quirky names I could name my restaurant.", "max_new_tokens":100, "temperature":0.6, "presence_penalty":2, "frequency_penalty":2, "top_k":-1, "top_p":0.6}' -H "content-type: application/json"

"\nI'm opening a pancake restaurant. List 3 quirky names I could name my restaurant.\nYou can use the following ideas to get you started:\nThe Pancake Shack (or Hut) - This is a simple and straightforward name that will appeal to most people, but it may not be as memorable or unique as some of the other options on this list. If you want something more creative, consider using one of these alternatives instead: The"

@francesy-scale francesy-scale requested a review from a team September 28, 2023 17:03
@francesy-scale francesy-scale self-assigned this Sep 28, 2023
request_id = str(uuid4())
add_trace_request_id(request_id)
if request.top_k == 0: # top_k can't be 0, only takes >= 1, or -1/None to disable top_k
request.top_k = -1
Copy link

@adlam-scale adlam-scale Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a check for validity of vllm/tgi/lightllm specific parameters? Otherwise the args will pass through silently and users will not know what happened.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, also we should centralize all of the framework-specific validation, like later in this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Field checks the validity, and gives out error message. In this case, I thought it doesn't make sense to have 0 as top_k anyways, so just assumed it can have the same effect as -1, but I can add another message.

frequency_penalty (Optional[float]):
Only affects: vllm, lightllm
Penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
https://platform.openai.com/docs/guides/gpt/parameter-details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels a bit weird to be linking to OpenAI docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm didn't link sources for them, but google search result says presence_penalty and frequency_penalty are from OpenAI, and the range matches too, so I linked it here

Whether to return the log probabilities of generated tokens.
When True, the response will include a list of tokens and their log probabilities.
repetition_penalty (Optional[float]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe someone who's more of a Python expert can correct me if I'm wrong, but I think if you have default numeric values, there's no need to make these Optional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no need but i don't see why we want to have them as non-optional

if request.temperature > 0:
args["parameters"]["temperature"] = request.temperature
args["parameters"]["do_sample"] = True
if request.top_k == -1: # tgi set to None to consider all tokens.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah alternatively, you can keep these as Optional like you have, but default to None.

Copy link
Member

@yixu34 yixu34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not specific to this PR, but I think we might be at the point where we have centralized framework validation modules for TGI, vLLM, LightLLM, etc. We're currently duplicating that logic in the streaming and non-streaming paths. @yunfeng-scale @ian-scale thoughts?

Whether to return the log probabilities of generated tokens.
When True, the response will include a list of tokens and their log probabilities.
repetition_penalty (Optional[float]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is the same as frequency_penalty? can you check the implementation? we shouldn't be exposing framework differences explicitly like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked, they have the same intention, but different implementation. The range is also different, repetition_penalty takes [1, infinity), frequency_penalty takes [-2, 2]. I thought it will be confusing if we simply replace the name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me take a look. i think exposing all these parameters would be confusing to end users

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like TGI's penalty uses division like in https://arxiv.org/pdf/1909.05858.pdf and vLLM and lightLLM uses minus. let's remove repetition_penalty and provide only presence_penalty and frequency_penalty

Copy link
Contributor Author

@francesy-scale francesy-scale Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TGI only supports repetition_penalty. Does this mean if users choose TGI framework, they won't have the option to penalize repetition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i would suggest anyone wants to use repetition penalty to migrate to vLLM.

Copy link
Contributor

@yunfeng-scale yunfeng-scale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits. please address before merge, thanks!

temperature: float = 0.2,
stop_sequences: Optional[List[str]] = None,
return_token_log_probs: Optional[bool] = False,
presence_penalty: float = 0.0, # vllm, lightllm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make these optional? is there a reason for them to be not optional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also no need to comment here about frameworks since it's specific later in main comment

@scaleapi scaleapi deleted a comment from shortcut-integration bot Sep 29, 2023
@yunfeng-scale
Copy link
Contributor

This is not specific to this PR, but I think we might be at the point where we have centralized framework validation modules for TGI, vLLM, LightLLM, etc. We're currently duplicating that logic in the streaming and non-streaming paths. @yunfeng-scale @ian-scale thoughts?

yeah some refactoring (probably plus unit tests) is needed


def validate_and_update_completion_params(
inference_framework: LLMInferenceFramework,
request: Union[CompletionSyncV1Request, CompletionStreamV1Request],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw if the type checker is still giving you trouble maybe https://docs.python.org/3.8/library/typing.html#user-defined-generic-types would help? at least this feels like the "proper" way to do things to me

Copy link
Contributor Author

@francesy-scale francesy-scale Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to pass the check if I call the function like this

new_request = validate_and_update_completion_params(endpoint_content.inference_framework, request)
assert isinstance(new_request, CompletionSyncV1Request)
request = new_request

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would bias against asserts in production code. let's convert to an if-statement that throws a ValueError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does something like new_request: CompletionSyncV1Request = validate... also work? basically another way of telling the type checker that you know it'll be a CompletionSyncV1Request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't work... I think the problem is if the return type is union, will have to do some type narrowing https://mypy.readthedocs.io/en/stable/type_narrowing.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use generics https://mypy.readthedocs.io/en/stable/generics.html

from typing import TypeVar, Sequence

T = TypeVar('T')

def validate_and_update_completion_params(
    inference_framework: LLMInferenceFramework,
    request: T,
):
...

@francesy-scale francesy-scale merged commit 9023370 into main Sep 29, 2023
@francesy-scale francesy-scale deleted the frances/completion branch September 29, 2023 21:52
@jenkspt
Copy link

jenkspt commented Oct 2, 2023

I just tried

from llmengine import Completion

response = Completion.create(
    model="llama-2-7b",
    prompt="Hello, my name is",
    max_new_tokens=10,
    temperature=0.2,
    top_p=0.6,
)

print(response.json())
print(response.output.text)

And getting

      1 from llmengine import Completion
----> 3 response = Completion.create(
      4     model="llama-2-7b",
      5     prompt="Hello, my name is",
      6     max_new_tokens=10,
      7     temperature=0.2,
      8     top_p=0.6,
      9 )
     11 print(response.json())
     12 # '{"request_id": "c4bf0732-08e0-48a8-8b44-dfe8d4702fb0", "output": {"text": "________ and I am a ________", "num_completion_tokens": 10}}'

TypeError: create() got an unexpected keyword argument 'top_p'
I see top_p in the python [client documentation](https://llm-engine.scale.com/api/python_client/#:~:text=responses%20or%20not.-,create,-classmethod)

I see top_p in the python client documentation

Do we need to add some tests for these parameters?

@francesy-scale
Copy link
Contributor Author

Do we need to add some tests for these parameters?

pip install scale-llm-engine --upgrade should fix the issue now. I will add some unit tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants