Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lonnnnnnnnng context load time before generation #34

Closed
generic-username0718 opened this issue Mar 13, 2023 · 7 comments
Closed

Lonnnnnnnnng context load time before generation #34

generic-username0718 opened this issue Mar 13, 2023 · 7 comments

Comments

@generic-username0718
Copy link

I'm running llama 65b on dual 3090s and at longer contexts I'm noticing seriously long context load times (the time between sending a prompt and tokens actually being received/streamed). It seems my CPU is only using a single core and maxing it out to 100%... Is there something it's doing that's heavily serialized? ... Any way to parallelize the workflow?

@qwopqwop200
Copy link
Owner

What code did you run?

@USBhost
Copy link
Contributor

USBhost commented Mar 13, 2023

I would like to confirm this issue as well. It really becomes noticeable when they're running chat vs normal/notebook. Chat with nothing set runs really fast but once you start putting context etc... start up speed just takes a nose dive.

4bit 65b on my A6000

@plhosk
Copy link

plhosk commented Mar 14, 2023

In the case of llama.cpp, when a long prompt is given you can see it output the provided prompt word by word at a slow rate even before it starts generating anything new. It's directly evident that it takes a longer time to to get through larger prompts. I guess a similar thing is happening here.

@USBhost
Copy link
Contributor

USBhost commented Mar 20, 2023

So I compared B&B 8bit and GPTQ 8bit and GPTQ was the only one that had a start delay. Something is causing a delay before anything starts generating.

@Digitous
Copy link

Runs pretty well once it starts.. not sure if it's loading something, reading layers then inferencing. It's definitely got its quirks of new tech, might just be a case of "well that's how it works"

@aljungberg
Copy link
Contributor

Probably fixed now, see #30.

@qwopqwop200
Copy link
Owner

I think this issue has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants