-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some words get lost when using a remote server #8
Comments
@ahyatt need your help here. |
I've noticed some oddness previously, but when looking at what we actually retrieve, it seems like the problem is on the server side. And the problem seems only with ollama. I'll take a look tonight to see if I can spot what might be going wrong, but from what I previously saw, it doesn't seem very obvious. |
BTW, @werelax when you say remote server, what precisely do you mean? |
Why do you think, that the problem was on server side? |
@ahyatt remote server = a server with an rtx4090 running ollama. I tried two scenarios:
Both scenarios resulted in corrupted generations. I believe the issue isn't with the server since everything worked fine with the previous version of |
@ahyatt I can create test stand as docker container for you. It will work without ollama and on any request will reply same ollama-formatted message fast. I bet problem in code for parsing ollama replies, but I can't check it now. |
@ahyatt some more info, in case it helps: I logged the responses received from the server when generating. In this generation: The server is sending all the words: My guess is that |
I figured out the issue - basically we aren't dealing with partial json responses correctly. I have a fix for Ollama, but I need to see if the same issue can hit my other providers. Once everything is working (should be tonight), I'll check everything in and create a new release. |
One issue is that if there was content streamed that was incomplete JSON, we would never parse the incomplete part. Now we make sure to only advance when we successfully parse, and try to be more precise about getting only valid JSON. This does not completely solve the problem, however. The other causes of missing content are currently unknown. This is a partial fix for s-kostyaev/ellama#8.
Well, actually, I maybe only figured out one issue. I can still reproduce the problem even after my fix, though. I also need to work more on my similar change to Open AI. For this reason, although I checked in a change, I'm not cutting a release tonight, but will resume work on this tomorrow. |
@ahyatt I've spent some time investigating, and I believe I've found the root of the issue. At its core, unexpected changes in the parameter Problem 1The initial issue lies in how Content-Type: application/x-ndjson^M
Date: Tue, 31 Oct 2023 17:00:48 GMT^M
Transfer-Encoding: chunked^M
^M
6a^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.556243645Z","response":" I","done":false}^M This is because the expression in (or (and (boundp 'url-http-end-of-headers) url-http-end-of-headers)
(save-match-data
(save-excursion
(goto-char (point-min))
(search-forward "\n\n" nil t)
(forward-line)
(point)))) executes without having the binding of Upon the second call to
This leftward shift in A potential solution might involve modifying the Problem 2The subsequent problem is more complex. Responses from
As a result, the initial set of numbers is stripped away. This leads to If my understanding is correct, the entire streaming framework of I'm definitely not an expert (and I'm probably wrong), but I think the only solution to this is to use an external process with |
As a side comment, I've also noticed that |
@werelax thank you very much for this investigation - I noticed those numbers too, but didn't realize that they could be causing us to miss things. I think there's some simple things we can try, notably using markers in the response buffer. I'll try those today. About using curl, I believe curl is actually used by the url-http library already, but haven't investigated the details. As to why it gets called multiple times - I fixed one in the commit above. It's because, as you mention, the streamed request is doing more than just appending. This seems unique to ollama. I've changed the code to only pay attention to appends. The interesting thing I noticed last night was now, we're only called once with the same request except when we miss a response, when we are called twice. Your problem statement doesn't seem to capture this aspect, but I bet there's some related behavior to what you describe that is causing the same request to be sent twice. Perhaps once before the number, one after the number is purged? But then why does it not happen more often? |
@ahyatt for what I understand, IMHO, using As a side note, I don't think |
The previous implementation of Open AI and Ollama streaming chat had an issue where small bits of text were missing from streaming responses. We switch to a different method of processing responses where every response is on a line, keeping track by message number, not position, which can move around subtly based on things appearing and disappearing in the response buffer. This fixes s-kostyaev/ellama#8.
I think you probably are right about using curl. If possible, I'd like to use it via the I've fixed the issue by basically dealing with things on a line-by-line basis, throwing away non-valid JSON lines, and instead of keeping the position around, we keep around the last parsed message number. This seems to work well for me, but please try the latest commits yourself and see if it solves your problem. If it does, I'll release the fix. |
@ahyatt I've done some testing and everything worked fine. Thank you very much for the fix! |
Cool, I'm going to make a new release and push it now. Thank you for reporting this and your excellent debugging! |
First of all, thanks for this awesome package! I've been using it and enjoying it heavily :)
But I'm having a weird issue since the migration to using llm as a backend. Here is a full report:
Issue: Generation Issues with Remote Server after Migration to LLM Backend
Description:
After migrating to using llm as a backend, I noticed that when I change the host to a remote server, the generation output appears incomplete with some words missing.
Steps to Reproduce:
Observed Behavior:
The generated output on the remote server is missing some words as shown in this example:
Additional Information:
Thank you for your attention to this matter and for the great package you've provided!
The text was updated successfully, but these errors were encountered: