some words get lost when using a remote server #8

werelax · 2023-10-30T16:55:18Z

First of all, thanks for this awesome package! I've been using it and enjoying it heavily :)

But I'm having a weird issue since the migration to using llm as a backend. Here is a full report:

Issue: Generation Issues with Remote Server after Migration to LLM Backend

Description:
After migrating to using llm as a backend, I noticed that when I change the host to a remote server, the generation output appears incomplete with some words missing.

Steps to Reproduce:

Change the host to a remote server.
Execute the generation.

Observed Behavior:
The generated output on the remote server is missing some words as shown in this example:

Additional Information:

This issue was not observed when generating locally.
The generation from the remote server is very fast, approximately 10 times faster than my local generation.
Prior to the migration to llm, the same server with identical settings worked flawlessly.

Thank you for your attention to this matter and for the great package you've provided!

s-kostyaev · 2023-10-30T17:50:18Z

@ahyatt need your help here.

werelax · 2023-10-30T17:51:51Z

As a follow-up, I tested ellama locally on the remote server and observed a slightly different behavior. Instead of omitting random words, it consistently misses one or two words at the beginning of the generation:

However, the generations work perfectly when executed through the ollama CLI client.

ahyatt · 2023-10-30T19:03:03Z

I've noticed some oddness previously, but when looking at what we actually retrieve, it seems like the problem is on the server side. And the problem seems only with ollama. I'll take a look tonight to see if I can spot what might be going wrong, but from what I previously saw, it doesn't seem very obvious.

ahyatt · 2023-10-30T19:04:11Z

BTW, @werelax when you say remote server, what precisely do you mean?

s-kostyaev · 2023-10-30T19:09:31Z

I've noticed some oddness previously, but when looking at what we actually retrieve, it seems like the problem is on the server side. And the problem seems only with ollama. I'll take a look tonight to see if I can spot what might be going wrong, but from what I previously saw, it doesn't seem very obvious.

Why do you think, that the problem was on server side?

werelax · 2023-10-30T19:21:32Z

@ahyatt remote server = a server with an rtx4090 running ollama.

I tried two scenarios:

On my laptop, I changed the llm-ollama-host in ellama-provider to point to the server's address.
I logged into the server via SSH and ran ellama in the server's Emacs installation.

Both scenarios resulted in corrupted generations. I believe the issue isn't with the server since everything worked fine with the previous version of ellama.

s-kostyaev · 2023-10-30T19:35:45Z

@ahyatt I can create test stand as docker container for you. It will work without ollama and on any request will reply same ollama-formatted message fast. I bet problem in code for parsing ollama replies, but I can't check it now.

werelax · 2023-10-30T19:36:09Z

@ahyatt some more info, in case it helps:

I logged the responses received from the server when generating. In this generation:

The server is sending all the words:

My guess is that llm-ollama--get-partial-chat-response is missing some of the json objects.

ahyatt · 2023-10-31T01:01:22Z

I figured out the issue - basically we aren't dealing with partial json responses correctly. I have a fix for Ollama, but I need to see if the same issue can hit my other providers. Once everything is working (should be tonight), I'll check everything in and create a new release.

One issue is that if there was content streamed that was incomplete JSON, we would never parse the incomplete part. Now we make sure to only advance when we successfully parse, and try to be more precise about getting only valid JSON. This does not completely solve the problem, however. The other causes of missing content are currently unknown. This is a partial fix for s-kostyaev/ellama#8.

ahyatt · 2023-10-31T05:59:04Z

Well, actually, I maybe only figured out one issue. I can still reproduce the problem even after my fix, though. I also need to work more on my similar change to Open AI. For this reason, although I checked in a change, I'm not cutting a release tonight, but will resume work on this tomorrow.

werelax · 2023-10-31T17:55:59Z

@ahyatt I've spent some time investigating, and I believe I've found the root of the issue. At its core, unexpected changes in the parameter response from llm-ollama--get-partial-chat-response are causing trouble. The code presumes that this parameter only expands by appending new content at its end, but that's not the case in reality. Let's break down the issues:

Problem 1

The initial issue lies in how ollama concludes its lines with ^M. This peculiar ending causes the function llm-request--content to be unable to strip the headers after the initial call to llm-ollama--get-partial-chat-response. When you log the values of response, the first call appears as:

Content-Type: application/x-ndjson^M
Date: Tue, 31 Oct 2023 17:00:48 GMT^M
Transfer-Encoding: chunked^M
^M
6a^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.556243645Z","response":" I","done":false}^M

This is because the expression in llm-request--content

    (or (and (boundp 'url-http-end-of-headers) url-http-end-of-headers)
        (save-match-data
          (save-excursion
            (goto-char (point-min))
            (search-forward "\n\n" nil t)
            (forward-line)
            (point))))

executes without having the binding of url-http-end-of-headers active yet, and search-forward doesn't find a match. The resulting action only removes the HTTP status line from the server response, leaving the headers untouched.

Upon the second call to llm-ollama--get-partial-chat-response, url-http-end-of-headers is indeed bound. This leads to the or expression yielding a different outcome. Consequently, the value of the response parameter gets altered "from the left", causing llm-ollama-last-position to mispoint. Here's the response value from the second call:

6a^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.556243645Z","response":" I","done":false}^M
^M
6d^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.563228047Z","response":" will","done":false}^M
^M
6e^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.570012846Z","response":" count","done":false}^M

This leftward shift in response content means llm-ollama-last-position now inadvertently targets the center of the second or third JSON object. The fallout from this is a noticeable gap of a few missing words right after the first word in my ellama generations.

A potential solution might involve modifying the (search-forward) expression in llm-request--content. It doesn't seem to be a major change.

Problem 2

The subsequent problem is more complex. Responses from ollama come with Transfer-Encoding: chunked, evident by the interspersed numbers (6a, 6d, 6e, ...) among the JSON lines. That's understandable. However, post-chunk processing, url-http purges these numbers from the buffer. When observing the value of response from llm-ollama--get-partial-chat-response, the modification is evident:

{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.556243645Z","response":" I","done":false}
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.563228047Z","response":" will","done":false}
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.570012846Z","response":" count","done":false}
^M
6d^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.576940246Z","response":" from","done":false}^M
^M
69^M
{"model":"dolphin2.1-mistral","created_at":"2023-10-31T17:00:48.583854197Z","response":" ","done":false}^M
^M

As a result, the initial set of numbers is stripped away. This leads to llm-ollama-last-position overlooking a few characters, missing the opening curly brace { of the next object, and, as a consequene, bypassing the following generated word altogether.

If my understanding is correct, the entire streaming framework of llm hinges on the after-change-functions hook linked to the buffer given back by url-request. This setup is robust when url-request merely augments content to the buffer. However, it breaks when other alterations occur to the buffer content, as is the case here.

I'm definitely not an expert (and I'm probably wrong), but I think the only solution to this is to use an external process with curl or similar to handle the streaming like the previous version of ellama was doing.

werelax · 2023-10-31T17:58:26Z

As a side comment, I've also noticed that llm-ollama--get-partial-chat-response gets called multiple times with the exact same response. I don't understand why.

ahyatt · 2023-10-31T18:19:46Z

@werelax thank you very much for this investigation - I noticed those numbers too, but didn't realize that they could be causing us to miss things. I think there's some simple things we can try, notably using markers in the response buffer. I'll try those today. About using curl, I believe curl is actually used by the url-http library already, but haven't investigated the details.

As to why it gets called multiple times - I fixed one in the commit above. It's because, as you mention, the streamed request is doing more than just appending. This seems unique to ollama. I've changed the code to only pay attention to appends. The interesting thing I noticed last night was now, we're only called once with the same request except when we miss a response, when we are called twice. Your problem statement doesn't seem to capture this aspect, but I bet there's some related behavior to what you describe that is causing the same request to be sent twice. Perhaps once before the number, one after the number is purged? But then why does it not happen more often?

werelax · 2023-10-31T19:57:00Z

@ahyatt for what I understand, url-http first inserts the raw response from the server in the response buffer and then url-http-chunked-encoding-after-change-function does a second pass and processes the chunks (and removes the chunk-length numbers). The only way can I see to solve the issue using markers would be to only move the marker after the second pass is done. But the code of url-http-chunked-encoding-after-change-function is not easy to follow (at least not for me). This approach might end up being too complicated and brittle.

IMHO, using (make-process) to run curl and using a :filter function to process the output would be a faster and more robust way to handle the requests.

As a side note, I don't think url-http is using curl. It uses the elisp-native url library. That's probably why all the low-level HTTP processing is done inside a buffer.

The previous implementation of Open AI and Ollama streaming chat had an issue where small bits of text were missing from streaming responses. We switch to a different method of processing responses where every response is on a line, keeping track by message number, not position, which can move around subtly based on things appearing and disappearing in the response buffer. This fixes s-kostyaev/ellama#8.

ahyatt · 2023-11-01T02:53:32Z

I think you probably are right about using curl. If possible, I'd like to use it via the plz GNU ELPA library, which should allow me to get rid of llm-request entirely. Whatever I do, this is a bigger change that I'll leave until later.

I've fixed the issue by basically dealing with things on a line-by-line basis, throwing away non-valid JSON lines, and instead of keeping the position around, we keep around the last parsed message number. This seems to work well for me, but please try the latest commits yourself and see if it solves your problem. If it does, I'll release the fix.

werelax · 2023-11-01T13:10:30Z

@ahyatt I've done some testing and everything worked fine. Thank you very much for the fix!

ahyatt · 2023-11-01T13:31:58Z

Cool, I'm going to make a new release and push it now. Thank you for reporting this and your excellent debugging!

s-kostyaev · 2023-11-01T17:24:55Z

@werelax thank you for report and debugging.
@ahyatt thank you for fast fix.

s-kostyaev closed this as completed in 2700be1 Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some words get lost when using a remote server #8

some words get lost when using a remote server #8

werelax commented Oct 30, 2023 •

edited

s-kostyaev commented Oct 30, 2023

werelax commented Oct 30, 2023

ahyatt commented Oct 30, 2023

ahyatt commented Oct 30, 2023

s-kostyaev commented Oct 30, 2023

werelax commented Oct 30, 2023

s-kostyaev commented Oct 30, 2023

werelax commented Oct 30, 2023 •

edited

ahyatt commented Oct 31, 2023

ahyatt commented Oct 31, 2023

werelax commented Oct 31, 2023

werelax commented Oct 31, 2023

ahyatt commented Oct 31, 2023

werelax commented Oct 31, 2023 •

edited

ahyatt commented Nov 1, 2023

werelax commented Nov 1, 2023

ahyatt commented Nov 1, 2023

s-kostyaev commented Nov 1, 2023

some words get lost when using a remote server #8

some words get lost when using a remote server #8

Comments

werelax commented Oct 30, 2023 • edited

s-kostyaev commented Oct 30, 2023

werelax commented Oct 30, 2023

ahyatt commented Oct 30, 2023

ahyatt commented Oct 30, 2023

s-kostyaev commented Oct 30, 2023

werelax commented Oct 30, 2023

s-kostyaev commented Oct 30, 2023

werelax commented Oct 30, 2023 • edited

ahyatt commented Oct 31, 2023

ahyatt commented Oct 31, 2023

werelax commented Oct 31, 2023

Problem 1

Problem 2

werelax commented Oct 31, 2023

ahyatt commented Oct 31, 2023

werelax commented Oct 31, 2023 • edited

ahyatt commented Nov 1, 2023

werelax commented Nov 1, 2023

ahyatt commented Nov 1, 2023

s-kostyaev commented Nov 1, 2023

werelax commented Oct 30, 2023 •

edited

werelax commented Oct 30, 2023 •

edited

werelax commented Oct 31, 2023 •

edited