Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking of long documents fails with LanguageTool Premium API #215

Open
protyposis opened this issue Jan 27, 2023 · 9 comments
Open

Checking of long documents fails with LanguageTool Premium API #215

protyposis opened this issue Jan 27, 2023 · 9 comments
Labels
1-bug 🐛 Issue type: Bug report (something isn't working as expected) 2-unconfirmed Issue status: Bug that needs to be reproduced (all new bugs have this label)

Comments

@protyposis
Copy link

Describe the bug

Long documents are not checked when the LanguageTool HTTP API is used. Requests seem to fail with error 413. When checking a (short) selection of the document, it works as expected. The full document is also correctly checked when the local LanguageTool instance is used (i.e., when no languageToolHttpServerUri is configured). A long document is any document with, e.g., 8000 words and 50000 characters.

Steps to reproduce

Open long document in VSCode, wait forever for language check results (or watch failure in LTeX Language Server logs).

Expected behavior

The document is fully checked. If the text is too long, I would expect it to be split into multiple separate requests that get successfully processed. In the worst case though, it should at least show a visible warning to the user instead of silently failing.

Sample document

Reproduction sample can be generated on https://www.lipsum.com/ by choosing 8000 words.

LTeX configuration

    "ltex.languageToolHttpServerUri": "https://api.languagetoolplus.com/",
    "ltex.languageToolOrg.username": "[removed]",
    "ltex.languageToolOrg.apiKey": "[removed]"

LTeX LS log

Jan 27, 2023 7:19:22 PM org.bsplines.ltexls.server.DocumentChecker logTextToBeChecked
FINE: Checking the following text in language 'en-US' via LanguageTool: "[removed]"... (truncated to 100 characters)
Jan 27, 2023 7:19:23 PM org.bsplines.ltexls.languagetool.LanguageToolHttpInterface checkInternal
SEVERE: LanguageTool failed with HTTP status code 413
Jan 27, 2023 7:19:23 PM org.bsplines.ltexls.server.DocumentChecker checkAnnotatedTextFragment
FINE: Obtained 0 rule matches

Version information

  • Operating system: Windows 11
  • vscode-ltex: 13.1.0
  • ltex-ls: no idea how to figure this out from the VSCode extension
@protyposis protyposis added 1-bug 🐛 Issue type: Bug report (something isn't working as expected) 2-unconfirmed Issue status: Bug that needs to be reproduced (all new bugs have this label) labels Jan 27, 2023
@real-or-random
Copy link

I see the same issue on emacs / lsp-ltex-ls (though I can't find a log that confirms the 413 error code)

For me, the limit seems to be around ~20000 chars, and according to https://languagetoolplus.com/http-api/#/default this would mean that my credentials don't really work... Do you have any hints on how to debug this?

@real-or-random
Copy link

real-or-random commented Feb 8, 2023

Okay, I did some more checking by enabling logging.

  • When I change my wrong username/api key to the config, I get 403. And I see corrections for Premium rules, so the credentials work in general.
  • When I try the API manually with the failing tests (using the web interface https://languagetoolplus.com/http-api/#/default), checking works.
  • In the log I see a lot of these:
FINEST: annotatedTextParts = [TEXT("L"), TEXT("o"), TEXT("r"), TEXT("e"), TEXT("m"), MARKUP(" "), FAKE_CONTENT(" "), TEXT("i"), TEXT("p"), TEXT("s"), TEXT("u"), TEXT("m"), MARKUP(" "), FAKE_CONTENT(" "), TEXT("d"), TEXT("o"), TEXT("l"), TEXT("o"), TEXT("r"), MARKUP(" "), FAKE_CONTENT(" "), TEXT("s"), TEXT("i"), TEXT("t"), MARKUP(" "), `

Are the texts actually sent like this in JSON, i.e., everything is split into single elements for every character? Maybe the request body will be too large then.

Edit: It seems the answer is yes:

private fun convertAnnotatedTextToJson(annotatedText: AnnotatedText): JsonElement {
val jsonDataAnnotation = JsonArray()
val parts: List<TextPart> = annotatedText.parts
var i = 0
while (i < parts.size) {
val jsonPart = JsonObject()
if (parts[i].type == TextPart.Type.TEXT) {
jsonPart.addProperty("text", parts[i].part)
} else if (parts[i].type == TextPart.Type.MARKUP) {
jsonPart.addProperty("markup", parts[i].part)
if ((i < parts.size - 1) && (parts[i + 1].type == TextPart.Type.FAKE_CONTENT)) {
i++
jsonPart.addProperty("interpretAs", parts[i].part)
}
} else {
// should not happen
i++
continue
}
jsonDataAnnotation.add(jsonPart)
i++
}
return jsonDataAnnotation
}

I'm unsure if this is the root cause, but the loop can certainly be optimized to produce shorter JSON.

@real-or-random
Copy link

I'm unsure if this is the root cause, but the loop can certainly be optimized to produce shorter JSON.

Okay, this is the root cause... I have some local changes that optimize the JSON output. I can open a PR soon.

@Musta-Pollo
Copy link

That would be very nice 👍

@real-or-random
Copy link

See #228, which works well for me locally.

Still, more should be done. We should at least truncate the JSON output at the API limit. We could also split it into multiple requests, but I'm convinced that this is much better because then you easily hit the minutely limits.


By the way, it's still a good idea to set ltex.checkFrequency to "save" to avoid hitting the API limits. This makes Premium much less useful, and not clearly better than the open-source version. I complained about the low limits about https://forum.languagetool.org/t/disappointing-api-limits-for-premium/8728, feel free to join me if this also bothers you.

@intractabilis
Copy link

I still see this problem in VS Code. Was the extension updated on the VS Code marketplace? Should I install something manually?

@intractabilis
Copy link

I tried a night build from the release section of GitHub repository. VS Code shows "Starting LTeX..." at the bottom forever. The LTeX Language Server output shows

[Info  - 7:20:58 PM] Starting ltex-ls...
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Jul 25, 2023 7:21:01 PM org.bsplines.ltexls.server.LtexLanguageServer initialize
INFO: ltex-ls 16.0.1-alpha.1.nightly.2023-07-25 - initializing...
Jul 25, 2023 7:21:01 PM org.bsplines.ltexls.tools.I18n setLocale

I tried changing the Java runtime, no help. How can I use the fix for this problem?

@ritscAlex
Copy link

I get the same problem while using nvim v0.9.1 and ltex-ls v16.0.0 implemented with null-ls. However, the same file gets checked correctly in vscode with vscode-ltex v13.1.0

@intractabilis
Copy link

@danielnaber I reached out to LanguageTooler GmbH. I explained to the support that the limit of 150,000 characters is a fake because this limit includes the characters in the augmented text, not the actual text being checked. Users cannot control the former. Since it is entirely unexpected for any user, it elevates to being lied to about the product when users buy it. I suggested setting the limit on the characters of the actual text. The support replied, “This is not intended to be changed,” and ghosted. They didn’t comment about lying about the product. I tried talking to a lawyer in the US, but he said that the fact that the company is in Germany makes it difficult. So, guys, if you are in Germany, you may be able to file a customer fraud case.

Meanwhile, I've done the minimum I could: I have canceled my Premium subscription. The funny part is that they've sent me an email (automated) asking me to cancel my cancellation. Within the sales pitch, among other things, they mentioned that “It can also check longer texts with up to 100,000 characters.” I replied that 100,000 characters is a lie because it limits an arbitrary amount of augmentation, not the actual text. They didn't answer. Oh, well. I switched to Grammarly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1-bug 🐛 Issue type: Bug report (something isn't working as expected) 2-unconfirmed Issue status: Bug that needs to be reproduced (all new bugs have this label)
Projects
None yet
Development

No branches or pull requests

5 participants