[llava][14/N] Refactor runner prefill() and run_model_step() #4556

larryliu0820 · 2024-08-06T07:25:08Z

Stack from ghstack (oldest at bottom):

This refactoring is needed in order to extract out prefill() and
run_model_step() out from runner so that these APIs become replaceable
and easy to plugin and use.

prefill():
For the case where parallel prefill is enabled or not using kv cache,
the model is able to accept a large block (more than 1) of tokens.

For the other case where we have kv cache but parallel prefill is not
enabled, we can only feed in 1 token every time.

run_model_step():
This function should not update the input. Instead it should run the
model differently, depending on whether kv cache is enabled. This should
return the next token directly. All the input update needs to happen in
the generation loop.

Differential Revision: D60840327

This refactoring is needed in order to extract out prefill() and run_model_step() out from runner so that these APIs become replaceable and easy to plugin and use. * prefill(): For the case where parallel prefill is enabled or not using kv cache, the model is able to accept a large block (more than 1) of tokens. For the other case where we have kv cache but parallel prefill is not enabled, we can only feed in 1 token every time. * run_model_step(): This function should not update the input. Instead it should run the model differently, depending on whether kv cache is enabled. This should return the next token directly. All the input update needs to happen in the generation loop. [ghstack-poisoned]

pytorch-bot · 2024-08-06T07:25:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4556

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4ff3a90 with merge base 92edd04 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This refactoring is needed in order to extract out prefill() and run_model_step() out from runner so that these APIs become replaceable and easy to plugin and use. * prefill(): For the case where parallel prefill is enabled or not using kv cache, the model is able to accept a large block (more than 1) of tokens. For the other case where we have kv cache but parallel prefill is not enabled, we can only feed in 1 token every time. * run_model_step(): This function should not update the input. Instead it should run the model differently, depending on whether kv cache is enabled. This should return the next token directly. All the input update needs to happen in the generation loop. ghstack-source-id: e0f4f74 Pull Request resolved: #4556

This refactoring is needed in order to extract out prefill() and run_model_step() out from runner so that these APIs become replaceable and easy to plugin and use. * prefill(): For the case where parallel prefill is enabled or not using kv cache, the model is able to accept a large block (more than 1) of tokens. For the other case where we have kv cache but parallel prefill is not enabled, we can only feed in 1 token every time. * run_model_step(): This function should not update the input. Instead it should run the model differently, depending on whether kv cache is enabled. This should return the next token directly. All the input update needs to happen in the generation loop. [ghstack-poisoned]

This refactoring is needed in order to extract out prefill() and run_model_step() out from runner so that these APIs become replaceable and easy to plugin and use. * prefill(): For the case where parallel prefill is enabled or not using kv cache, the model is able to accept a large block (more than 1) of tokens. For the other case where we have kv cache but parallel prefill is not enabled, we can only feed in 1 token every time. * run_model_step(): This function should not update the input. Instead it should run the model differently, depending on whether kv cache is enabled. This should return the next token directly. All the input update needs to happen in the generation loop. ghstack-source-id: 62eb4f6 Pull Request resolved: #4556

This refactoring is needed in order to extract out prefill() and run_model_step() out from runner so that these APIs become replaceable and easy to plugin and use. * prefill(): For the case where parallel prefill is enabled or not using kv cache, the model is able to accept a large block (more than 1) of tokens. For the other case where we have kv cache but parallel prefill is not enabled, we can only feed in 1 token every time. * run_model_step(): This function should not update the input. Instead it should run the model differently, depending on whether kv cache is enabled. This should return the next token directly. All the input update needs to happen in the generation loop. [ghstack-poisoned]

This refactoring is needed in order to extract out prefill() and run_model_step() out from runner so that these APIs become replaceable and easy to plugin and use. * prefill(): For the case where parallel prefill is enabled or not using kv cache, the model is able to accept a large block (more than 1) of tokens. For the other case where we have kv cache but parallel prefill is not enabled, we can only feed in 1 token every time. * run_model_step(): This function should not update the input. Instead it should run the model differently, depending on whether kv cache is enabled. This should return the next token directly. All the input update needs to happen in the generation loop. ghstack-source-id: b4a1220 Pull Request resolved: #4556

larryliu0820 · 2024-08-06T17:51:39Z