Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some words get lost when using a remote server #8

Closed
werelax opened this issue Oct 30, 2023 · 18 comments
Closed

some words get lost when using a remote server #8

werelax opened this issue Oct 30, 2023 · 18 comments

Comments

@werelax
Copy link

werelax commented Oct 30, 2023

First of all, thanks for this awesome package! I've been using it and enjoying it heavily :)

But I'm having a weird issue since the migration to using llm as a backend. Here is a full report:

Issue: Generation Issues with Remote Server after Migration to LLM Backend

Description:
After migrating to using llm as a backend, I noticed that when I change the host to a remote server, the generation output appears incomplete with some words missing.

Steps to Reproduce:

  1. Change the host to a remote server.
  2. Execute the generation.

Observed Behavior:
The generated output on the remote server is missing some words as shown in this example:

image

Additional Information:

  • This issue was not observed when generating locally.
  • The generation from the remote server is very fast, approximately 10 times faster than my local generation.
  • Prior to the migration to llm, the same server with identical settings worked flawlessly.

Thank you for your attention to this matter and for the great package you've provided!

@s-kostyaev
Copy link
Owner

@ahyatt need your help here.

@werelax
Copy link
Author

werelax commented Oct 30, 2023

As a follow-up, I tested ellama locally on the remote server and observed a slightly different behavior. Instead of omitting random words, it consistently misses one or two words at the beginning of the generation:

image

However, the generations work perfectly when executed through the ollama CLI client.

image

@ahyatt
Copy link

ahyatt commented Oct 30, 2023

I've noticed some oddness previously, but when looking at what we actually retrieve, it seems like the problem is on the server side. And the problem seems only with ollama. I'll take a look tonight to see if I can spot what might be going wrong, but from what I previously saw, it doesn't seem very obvious.

@ahyatt
Copy link

ahyatt commented Oct 30, 2023

BTW, @werelax when you say remote server, what precisely do you mean?

@s-kostyaev
Copy link
Owner

I've noticed some oddness previously, but when looking at what we actually retrieve, it seems like the problem is on the server side. And the problem seems only with ollama. I'll take a look tonight to see if I can spot what might be going wrong, but from what I previously saw, it doesn't seem very obvious.

Why do you think, that the problem was on server side?

@werelax
Copy link
Author

werelax commented Oct 30, 2023

@ahyatt remote server = a server with an rtx4090 running ollama.

I tried two scenarios:

  1. On my laptop, I changed the llm-ollama-host in ellama-provider to point to the server's address.
  2. I logged into the server via SSH and ran ellama in the server's Emacs installation.

Both scenarios resulted in corrupted generations. I believe the issue isn't with the server since everything worked fine with the previous version of ellama.

@s-kostyaev
Copy link
Owner

@ahyatt I can create test stand as docker container for you. It will work without ollama and on any request will reply same ollama-formatted message fast. I bet problem in code for parsing ollama replies, but I can't check it now.

@werelax
Copy link
Author

werelax commented Oct 30, 2023

@ahyatt some more info, in case it helps:

I logged the responses received from the server when generating. In this generation:

image

The server is sending all the words:

image

My guess is that llm-ollama--get-partial-chat-response is missing some of the json objects.

@ahyatt
Copy link

ahyatt commented Oct 31, 2023

I figured out the issue - basically we aren't dealing with partial json responses correctly. I have a fix for Ollama, but I need to see if the same issue can hit my other providers. Once everything is working (should be tonight), I'll check everything in and create a new release.

ahyatt added a commit to ahyatt/llm that referenced this issue Oct 31, 2023
One issue is that if there was content streamed that was incomplete JSON, we
would never parse the incomplete part.  Now we make sure to only advance when we
successfully parse, and try to be more precise about getting only valid JSON.

This does not completely solve the problem, however.  The other causes of
missing content are currently unknown.

This is a partial fix for s-kostyaev/ellama#8.
@ahyatt
Copy link

ahyatt commented Oct 31, 2023

Well, actually, I maybe only figured out one issue. I can still reproduce the problem even after my fix, though. I also need to work more on my similar change to Open AI. For this reason, although I checked in a change, I'm not cutting a release tonight, but will resume work on this tomorrow.

@werelax
Copy link
Author

werelax commented Oct 31, 2023

@ahyatt I've spent some time investigating, and I believe I've found the root of the issue. At its core, unexpected changes in the parameter response from llm-ollama--get-partial-chat-response are causing trouble. The code presumes that this parameter only expands by appending new content at its end, but that's not the case in reality. Let's break down the issues:

Problem 1

The initial issue lies in how ollama concludes its lines with ^M. This peculiar ending causes the function llm-request--content to be unable to strip the headers after the initial call to llm-ollama--get-partial-chat-response. When you log the values of response, the first call appears as:

Content-Type: application/x-ndjson^M
Date: Tue, 31 Oct 2023 17:00:48 GMT^M
Transfer-Encoding: chunked^M
^M
6a^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.556243645Z","response":" I","done":false}^M

This is because the expression in llm-request--content

    (or (and (boundp 'url-http-end-of-headers) url-http-end-of-headers)
        (save-match-data
          (save-excursion
            (goto-char (point-min))
            (search-forward "\n\n" nil t)
            (forward-line)
            (point))))

executes without having the binding of url-http-end-of-headers active yet, and search-forward doesn't find a match. The resulting action only removes the HTTP status line from the server response, leaving the headers untouched.

Upon the second call to llm-ollama--get-partial-chat-response, url-http-end-of-headers is indeed bound. This leads to the or expression yielding a different outcome. Consequently, the value of the response parameter gets altered "from the left", causing llm-ollama-last-position to mispoint. Here's the response value from the second call:

6a^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.556243645Z","response":" I","done":false}^M
^M
6d^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.563228047Z","response":" will","done":false}^M
^M
6e^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.570012846Z","response":" count","done":false}^M

This leftward shift in response content means llm-ollama-last-position now inadvertently targets the center of the second or third JSON object. The fallout from this is a noticeable gap of a few missing words right after the first word in my ellama generations.

A potential solution might involve modifying the (search-forward) expression in llm-request--content. It doesn't seem to be a major change.

Problem 2

The subsequent problem is more complex. Responses from ollama come with Transfer-Encoding: chunked, evident by the interspersed numbers (6a, 6d, 6e, ...) among the JSON lines. That's understandable. However, post-chunk processing, url-http purges these numbers from the buffer. When observing the value of response from llm-ollama--get-partial-chat-response, the modification is evident:

{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.556243645Z","response":" I","done":false}
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.563228047Z","response":" will","done":false}
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.570012846Z","response":" count","done":false}
^M
6d^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.576940246Z","response":" from","done":false}^M
^M
69^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.583854197Z","response":" ","done":false}^M
^M

As a result, the initial set of numbers is stripped away. This leads to llm-ollama-last-position overlooking a few characters, missing the opening curly brace { of the next object, and, as a consequene, bypassing the following generated word altogether.

If my understanding is correct, the entire streaming framework of llm hinges on the after-change-functions hook linked to the buffer given back by url-request. This setup is robust when url-request merely augments content to the buffer. However, it breaks when other alterations occur to the buffer content, as is the case here.

I'm definitely not an expert (and I'm probably wrong), but I think the only solution to this is to use an external process with curl or similar to handle the streaming like the previous version of ellama was doing.

@werelax
Copy link
Author

werelax commented Oct 31, 2023

As a side comment, I've also noticed that llm-ollama--get-partial-chat-response gets called multiple times with the exact same response. I don't understand why.

@ahyatt
Copy link

ahyatt commented Oct 31, 2023

@werelax thank you very much for this investigation - I noticed those numbers too, but didn't realize that they could be causing us to miss things. I think there's some simple things we can try, notably using markers in the response buffer. I'll try those today. About using curl, I believe curl is actually used by the url-http library already, but haven't investigated the details.

As to why it gets called multiple times - I fixed one in the commit above. It's because, as you mention, the streamed request is doing more than just appending. This seems unique to ollama. I've changed the code to only pay attention to appends. The interesting thing I noticed last night was now, we're only called once with the same request except when we miss a response, when we are called twice. Your problem statement doesn't seem to capture this aspect, but I bet there's some related behavior to what you describe that is causing the same request to be sent twice. Perhaps once before the number, one after the number is purged? But then why does it not happen more often?

@werelax
Copy link
Author

werelax commented Oct 31, 2023

@ahyatt for what I understand, url-http first inserts the raw response from the server in the response buffer and then url-http-chunked-encoding-after-change-function does a second pass and processes the chunks (and removes the chunk-length numbers). The only way can I see to solve the issue using markers would be to only move the marker after the second pass is done. But the code of url-http-chunked-encoding-after-change-function is not easy to follow (at least not for me). This approach might end up being too complicated and brittle.

IMHO, using (make-process) to run curl and using a :filter function to process the output would be a faster and more robust way to handle the requests.

As a side note, I don't think url-http is using curl. It uses the elisp-native url library. That's probably why all the low-level HTTP processing is done inside a buffer.

ahyatt added a commit to ahyatt/llm that referenced this issue Nov 1, 2023
The previous implementation of Open AI and Ollama streaming chat had an issue
where small bits of text were missing from streaming responses. We switch to a
different method of processing responses where every response is on a line,
keeping track by message number, not position, which can move around subtly
based on things appearing and disappearing in the response buffer.

This fixes s-kostyaev/ellama#8.
@ahyatt
Copy link

ahyatt commented Nov 1, 2023

I think you probably are right about using curl. If possible, I'd like to use it via the plz GNU ELPA library, which should allow me to get rid of llm-request entirely. Whatever I do, this is a bigger change that I'll leave until later.

I've fixed the issue by basically dealing with things on a line-by-line basis, throwing away non-valid JSON lines, and instead of keeping the position around, we keep around the last parsed message number. This seems to work well for me, but please try the latest commits yourself and see if it solves your problem. If it does, I'll release the fix.

@werelax
Copy link
Author

werelax commented Nov 1, 2023

@ahyatt I've done some testing and everything worked fine. Thank you very much for the fix!

@ahyatt
Copy link

ahyatt commented Nov 1, 2023

Cool, I'm going to make a new release and push it now. Thank you for reporting this and your excellent debugging!

@s-kostyaev
Copy link
Owner

@werelax thank you for report and debugging.
@ahyatt thank you for fast fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants