Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/grpc streaming #2186

Merged
merged 10 commits into from
Mar 28, 2023
Merged

Feature/grpc streaming #2186

merged 10 commits into from
Mar 28, 2023

Conversation

lxning
Copy link
Collaborator

@lxning lxning commented Mar 20, 2023

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

This PR support the new feature of GRPC server side streaming. it includes:

  • new GRPC API StreamPredictions
  • GRPC client and server updates
  • TS frontend workerthread communication with backend for continuous response
  • Backend supports send intermediate response
  • regression test test_inference_stream_apis

Fixes #(issue)
#2180

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Checklist:

  • Did you have fun?

  • Have you added tests that prove your fix is effective or that this feature works?

  • Has code been commented, particularly in hard-to-understand areas?

  • Have you made corresponding changes to the documentation?

@lxning lxning requested review from vdantu and msaroufim March 20, 2023 17:42
@lxning lxning self-assigned this Mar 20, 2023
@lxning lxning added the enhancement New feature or request label Mar 20, 2023
@lxning lxning added this to the v0.8.0 milestone Mar 20, 2023
@codecov
Copy link

codecov bot commented Mar 20, 2023

Codecov Report

Merging #2186 (0e3e627) into master (41a3af3) will decrease coverage by 0.13%.
The diff coverage is 38.46%.

❗ Current head 0e3e627 differs from pull request most recent head 8132f23. Consider uploading reports for the commit 8132f23 to get more accurate results

@@            Coverage Diff             @@
##           master    #2186      +/-   ##
==========================================
- Coverage   71.45%   71.32%   -0.13%     
==========================================
  Files          73       73              
  Lines        3296     3306      +10     
  Branches       57       57              
==========================================
+ Hits         2355     2358       +3     
- Misses        941      948       +7     
Impacted Files Coverage Δ
ts/protocol/otf_message_handler.py 72.58% <25.00%> (-2.15%) ⬇️
ts/model_service_worker.py 65.89% <50.00%> (ø)
ts/service.py 77.46% <50.00%> (-0.80%) ⬇️
ts/context.py 67.10% <100.00%> (+0.43%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

docs/grpc_api.md Show resolved Hide resolved
docs/grpc_api.md Outdated Show resolved Hide resolved
Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a bunch of minor feedback but not sure I have enough context on what this PR is trying to do to give system-level feedback

docs/grpc_api.md Outdated
@@ -70,3 +71,28 @@ python ts_scripts/torchserve_grpc_client.py infer densenet161 examples/image_cla
```bash
python ts_scripts/torchserve_grpc_client.py unregister densenet161
```
## GRPC Server Side Streaming
TorchServe GRPC APIs adds a server side streaming of the inference API "StreamPredictions" to allow a sequence of inference responses to be sent over the same GRPC stream. This new API is only recommended for the use case when the inference full response latency is high, and the inference intermediate results are sent to client. This new API automatically forces the batchSize to be one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b q: what does intermediate response mean? I initially understood this feature as send partial batches back so what's the scenario in which it'd be useful to use this feature? Or is this ineternal only to the large model work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msaroufim TS backend message protocol does not allow send partial batch(eg, batchSize=10, only send 5 batches result) to frontend. (see code).

This feature is used for use case such as generative AI where the latency is pretty high to generate full result. This feature allows users to send partial result back to client gradually.

if type(data) is list:
for i in range (3):
send_intermediate_predict_response(["hello"], context.request_ids, "Intermediate Prediction success", 200, context)
return ["hello world "]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be an async request

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be async. Customized handler can decides sync or async based on its real use case.

@@ -56,8 +62,8 @@ public BaseModelRequest getRequest(String threadName, WorkerState state)
}

public void sendResponse(ModelWorkerResponse message) {
boolean jobDone = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this variable name is a bit confusing, outside of the context of streaming - the job is not done yet, maybe streamcomplete or something of the sort would be cleaerer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most use cases are non-streaming. They only require one single message retrieving for a batch of jobs. Var "jobDone" is used to reflect if the message retrieving is completed for a batch of jobs.

@@ -201,8 +200,9 @@ public void pollBatch(String threadId, long waitTime, Map<String, Job> jobsRepo)
logger.trace("get first job: {}", Objects.requireNonNull(j).getJobId());

jobsRepo.put(j.getJobId(), j);
// describe request job batch size always is 1
if (j.getCmd() == WorkerCommands.DESCRIBE) {
// batch size always is 1 for describe request job and stream prediction request job
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understood this limitation why batch size 1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the generative AI, it is expensive to process one single request. It will make latency higher if batch size >1.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really true. We might want a batch of streams as well.It is upto the client.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another issue with batch_size > 1 is that there isn't a way to differentiate which stream chunk belongs to which request. Maybe we can utilize the requestId in the job to associate the chunk to request but that is assigned to be a uuid when the frontend receives a request but the client is unable to differentiate.

Copy link
Collaborator Author

@lxning lxning Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two issues when batch size is set >1.

  • it breaks current the protocol b/w frontend and backend. eg. some request intermediate result are success, some are failures.
  • the latency most likely will be even higher if batch size is larger than 1.

)
)

print(response.msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these prints necessary? slightly worries that messages will fill out our CI logs which are already long to make search frustrating

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is helpful for debugging the regression test failure point (ie. which model registration fails).

for resp in responses:
prediction = resp.prediction.decode("utf-8")
print(prediction)
except grpc.RpcError as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also catch the UnicodeDecodeError?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not necessary to exit client for utf error. Only rpc error is fatal.

ts/context.py Show resolved Hide resolved
for resp in responses:
prediction.append(resp.prediction.decode("utf-8"))

return " ".join(prediction)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this be a list of partial predictions so a single prediction or a list of multiplee predictions with batch size 1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prediction is a list of partial prediction responses. Here returns joining all of the partial response together to make the later comparing with expected values much easier.

@msaroufim msaroufim self-requested a review March 21, 2023 22:29
docs/grpc_api.md Outdated
@@ -70,3 +71,28 @@ python ts_scripts/torchserve_grpc_client.py infer densenet161 examples/image_cla
```bash
python ts_scripts/torchserve_grpc_client.py unregister densenet161
```
## GRPC Server Side Streaming
TorchServe GRPC APIs adds a server side streaming of the inference API "StreamPredictions" to allow a sequence of inference responses to be sent over the same GRPC stream. This new API is only recommended for the use case when the inference full response latency is high, and the inference intermediate results are sent to client. This new API automatically forces the batchSize to be one.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TorchServe GRPC APIs adds a server side streaming of the inference API "StreamPredictions" to allow a sequence of inference responses to be sent over the same GRPC stream. This new API is only recommended for the use case when the inference full response latency is high, and the inference intermediate results are sent to client. This new API automatically forces the batchSize to be one.
TorchServe GRPC APIs adds a server side streaming of the inference API "StreamPredictions" to allow a sequence of inference responses to be sent over the same GRPC stream. This new API is only recommended for the use case when the inference latency of the full response is high and the inference intermediate results are sent to client. An example could be LLMs for generative applications, where generating "n" number of tokens can have high latency, in this case user can receive each generated token once ready until the full response completes. This new API automatically forces the batchSize to be one.

ModelInferenceRequest inferReq = (ModelInferenceRequest) req;
boolean streamNext = true;
while (streamNext) {
reply = replies.poll(responseTimeout, TimeUnit.SECONDS);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like responseTimeout is same for streaming and non-streaming as well. clients might want different timeouts for streaming and non streaming api right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responseTimeout is planned to move to model level config.

@lxning lxning merged commit d0510ba into master Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants