-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON parse errors handling streamed responses #44
Comments
Hey @rcgtnick thanks for the report. I'm confused by this because twinny is using the As far as I can tell you are using your server as some kind of proxy between twinny and the ollama instance? Are you running a proxy server between Ollama and your API? If so, are you also piping the request or taking the input and making an API call without Many thanks. |
I may be wrong about the cause, then. The only thing between Twinny and Ollama is a reverse proxy, which shouldn't be modifying headers or request bodies at all. I captured some network traffic for a request between the proxy and Ollama, to see what is happening on the wire: Looks like it's not the chunking - while the last chunk is quite large, it's still just one chunk. However, if I enable developer tools in VSCode, add a Interestingly, though, a capture from the local Ollama request shows an even larger response, but it's received by Twinny all together and parsed just fine: Any other idea what would cause this? |
After looking at both captures, all of the JSON that is returned/printed in the document seems to be parseable so it makes it harder for me to understand what could be causing it. If it is as you say that large responses are the issue I think a fix might be to check for |
Just to rule out the proxy, I opened a port directly to Ollama and captured a Twinny request from the server side. The results look like the above two captures: complete JSON objects in each chunk, but the last chunk is very large. The same error occurred: JSON parsing error expecting a comma or
|
And just so it's all here, here's an example of a cut-off response object right before Twinny tries to parse it as JSON:
(That's for a different request than the capture, so the responses won't line up exactly.) |
I just added a new version to cut off everything after |
Thanks, I'll give it a try. |
@rcgtnick all good? |
EDIT: This was on v2.6.18. Now it looks like multiple chunks are getting processed at the same time. Here is the console.log with the object just before it's passed to {"model":"codellama:13b-code","created_at":"2024-01-24T13:40:28.671507727Z","response":"\n","done":false}
{"model":"codellama:13b-code","created_at":"2024-01-24T13:40:28.683629072Z","response":" ","done":false} And here's the stack trace from the extension upon calling JSON.parse:
Based on packet captures, the server is going to send multiple JSON objects in one response, and it looks like the extension is processing both as a single JSON object. |
On v2.6.21 I see the large, truncated JSON object and a JSON parsing error, same as before 2.6.18.
How about I give you a server you can send requests to? |
I'd be happy to try a server hosted version in case there is any further can be done. However, I'm tempted to revert the flimsy changes introduced to remove the context in the previous release as it appears to be just a server issue. Feel free to email me at richardmacarthy@protonmail.com |
If this issue isn't high priority for you, I get it! I've stuck with this so far because I need a VSCode extension that will use a remote Ollama API, and most extensions I've tried only work with local Ollama. I've done a bunch of testing with past versions, and I've found:
I can't reproduce the issue with 2.6.14, but I can with 2.6.15 and later versions. The prompts are longer - I have a capture of 2.6.15 sending 25kB to the server and getting a 32kB response back (which triggers the JSON parsing issue). In 2.6.21, if I set the Context Length from 300 (the default) down to 20 I do not encounter the issue. I get responses of around 5kB, which seem to be fine. At 100 lines of context I get responses around 11kB, which do cause the issue. I'll try using the extension with small values here and see if it's still helpful. I'll send you an email and get you set up with a dev server in case you want to try it out. EDIT: I found the "Num Predict Fim" setting, looks like that gets me back smaller responses too, probably more reliably than limiting the context length. |
As for the code, I don't know JS, but it looks like maybe this pattern would be better than calling In the SO post, the |
I tried this ^^ and ended up where you did in an earlier attempt, sending all the chunks at once, which is a series of JSON objects but not one valid JSON object. I tried looking for chunk delimiters in the stream and that seems to have done it. I'll put up a PR. |
I see, I think that there may be a better way. The suggestion to add the callback in |
I just released version |
I don't see v2.6.23 yet, but I tested on v2.6.22. It's not raising the error any longer, but it seems harder to get it to suggest a completion. Maybe it's not related, but I really have to coax it now. |
Version |
I think it would be better to look for the delimiters in the HTTP chunked transfer encoding than to try parsing every response as JSON. Trying to parse everything as JSON is less efficient and won't work if you get a packet that is the end of one chunk and the start of another. Newlines are built into the protocol and are meant to tell you when a chunk ends, so splitting up the data base on newlines is more correct, more reliable, and more efficient. I don't care if you use my PR or not, but there's a packet trace in the description that shows exactly what I mean. |
Ok, thanks @rcgtnick Ill take another look soon. |
Hey @rcgtnick I just merged #48 which includes your changes. You're correct that this is a better solution for speed and efficiency. Please let me know if it works and we'll close the issue. Many thanks |
2.6.24 looks good! Glad I could help, and thanks for all your work on Twinny! |
Funny thing happened to me while I was writing a small LLM chat app that streamed responses. After a while of chatting with an LLM I started getting JSON parsing errors on my chunks. It took me a minute, but eventually I realized I already knew what the problem was and how to solve the it! |
Describe the bug
When I run a local Ollama server on my M1 MacBook, Twinny seems to always receive a single HTTP chunk with a complete JSON object it containing the query response.
However, when running Ollama on a server with a decent GPU, sometimes the responses are large enough that they get split up into multiple chunks, and Twinny attempts to parse incomplete JSON objects.
I believe there is a bug in Twinny's handling of Ollama's "chunked" HTTP transfer encoding, because if I inspect the object being passed to
JSON.parse
in the extension I see incomplete objects, and a stack trace like this from Twinny:It looks like Twinny is receiving a chunk and trying to parse it, but the chunk is not guaranteed to be valid JSON data:
I don't notice this locally, just when running against a server.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Twinny should accept the completion suggestion from the server and display it in the editor.
Actual Behaviour
Twinny attempts to parse an incomplete response, resulting in a JSON parsing stack trace, and no suggestion is shown.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: