Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Reserve more eval memory and use ggml scratch buffers #116

Merged
merged 3 commits into from
Apr 13, 2023

Conversation

juho-p
Copy link
Contributor

@juho-p juho-p commented Apr 6, 2023

This attempts to fix #115

Not the brightest of code. Scratch buffer usage is imitated from llama.cpp. evaluate context memory reservation is also imitated there (or at least that's the attempt). Note that I didn't actually go through all the things in llama.cpp, only the ones related to context memory size, so I might be missing something.

I'm quite sure this allocates a lot more extra memory for 7B and 13B models as well, even though I never had any issues with them running out of context memory. Scratch buffers take 1GB, and that's for every InferenceSession.

Maybe it would make sense to only use scratch buffers for inferring bigger models only? Though I think llama.cpp uses them always. Also, if there are multiple sessions, scratch memory buffers are not shared between them, but they could (just make sure only run one evaluate at a time somehow).

Anyway, with these changes I couldn't manage to run out of context memory any more, even with 65B model.

@juho-p juho-p force-pushed the fix-out-of-context-memory branch 2 times, most recently from 1e02c17 to b905791 Compare April 6, 2023 22:10
ggml/src/lib.rs Outdated Show resolved Hide resolved
@philpax
Copy link
Collaborator

philpax commented Apr 7, 2023

man I do not enjoy the amount of magic constants we're accumulating, but aside from the &mut thing I think this is fine. Does llama.cpp always reserve 1GB of scratch, even for 7B?

llama-rs/src/lib.rs Outdated Show resolved Hide resolved
@juho-p juho-p force-pushed the fix-out-of-context-memory branch 2 times, most recently from 248fc8c to bd5480c Compare April 8, 2023 08:04
@juho-p juho-p force-pushed the fix-out-of-context-memory branch from bd5480c to d279371 Compare April 8, 2023 19:30
@philpax philpax added this to the 0.1 milestone Apr 10, 2023
@jon-chuang
Copy link
Contributor

Will this help Context window full, stopping inference.?

@philpax
Copy link
Collaborator

philpax commented Apr 13, 2023

Will this help Context window full, stopping inference.?

No, that's because the model has a fixed context limit of 2048 tokens. We're investigating ways to improve this in #77.

@philpax philpax merged commit 5db8b4f into rustformers:main Apr 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Runs out of ggml context's memory pool on larger models
4 participants