Manage the input_pos internally in text llm runner

### 🚀 The feature, motivation and pitch

Instead of exposing `input_pos` in [`generate_from_pos()` API](https://github.com/pytorch/executorch/blob/main/extension/llm/runner/text_llm_runner.h#L119), we should redesign the API to hide the `input_pos` argument as an internal state.

We should support these features:

1. generate with an input prompt -> uses the current context, creates the response adds it to context, and adjusts start position of KV caching internally
2. Add context - used to hydrate KV cache for loading historical chat, adjusts start position internally when generate is called after it
3. clear context - remove prefilled tokens and reset start position

To be more specific,

* Add a private field `pos_` and manage it in all APIs.
* Keep the `generate()` API, but instead of assuming a start pos of 0, use the `pos_` field.
* Add `prefill()` API to be able to take chat history.
* Add `reset()` API to reset `pos_` to 0.

### Alternatives

_No response_

### Additional context

_No response_

### RFC (Optional)

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Manage the input_pos internally in text llm runner #12887

🚀 The feature, motivation and pitch

Alternatives

Additional context

RFC (Optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Manage the input_pos internally in text llm runner #12887

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

RFC (Optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions