Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Swap strategy for infinite output #77

Open
setzer22 opened this issue Mar 26, 2023 · 2 comments
Open

Swap strategy for infinite output #77

setzer22 opened this issue Mar 26, 2023 · 2 comments
Labels
issue:enhancement New feature or request

Comments

@setzer22
Copy link
Collaborator

setzer22 commented Mar 26, 2023

As discussed in ggerganov/llama.cpp#71 (comment)

The idea is to achieve a naive implementation for infinite output generation using a strategy that simply clears the context window (you can keep the original prompt around), and starts adding new tokens.

This is a hack that doesn't properly leverage the advantages of the attention mechanism: When the context window gets full, the transformer's hidden state has information about more than just the last 2048 tokens, because this information is there indirectly embedded in the outputs for the self-attention mechanism. For example, if token 25 attended to tokens 10 and 12, even when tokens 10 and 12 fall outside the context window, a lot of information about these tokens will still be encoded at position 25.

A solution that slides the context window would achieve a gradually "fading" context window, instead of something where the transformer 100% forgets about a word the moment a token falls outside of context. I have some reason to suspect systems like ChatGPT are relying on a mechanism like this based on their ability to consistently recall parts of the conversation that occured way before the token window was exceeded. However, I'm not knowledgeable enough to figure out if there's a way to actually make this work, given the fact that the positional encoding function used in LLaMA (RoPE) is absolute, not relative.

By doing the swap trick proposed here, the transformer will effectively forget all prior context whenever the swap occurs, and there will be a lag spike due to the last few tokens having to be reprocessed. So this is very much non-ideal. However, since llama.cpp has recently implemented this, I feel like we should at least add this naive version too until someone can figure out a real solution.

@jon-chuang
Copy link
Contributor

Yes, llama.cpp implements a "hacky" method like this, it takes the last $k$ tokens + first $n$ "prompt" tokens when the window becomes full

@olexiyb
Copy link
Contributor

olexiyb commented Sep 13, 2023

There is a pull request to solve this, please review
#424

@philpax philpax mentioned this issue Nov 12, 2023
17 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue:enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants