Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LanguageTool integration #515

Closed
vkbo opened this issue Dec 8, 2020 · 16 comments
Closed

LanguageTool integration #515

vkbo opened this issue Dec 8, 2020 · 16 comments
Assignees
Labels
discussion Meta: Feature discussions editor Component: Editor enhancement Request: New feature or improvement potential feature Request: May be considered later

Comments

@vkbo
Copy link
Owner

vkbo commented Dec 8, 2020

Add a tool that can integrate with a locally run or an online provided instance of the LanguageTool embedded HTTP Server, see: https://dev.languagetool.org/http-server.

Thanks to @kyrsjo for reminding me of this. I recognise it, and have considered it before.

It should only be loosely integrated with novelWriter, and could be part of a larger toolbox of text analysis options that I'm anyway considering writing. Perhaps it would be useful to wrap the LanguageTool server in an independent, simple GUI launcher that can be run externally. The novelWriter side of it should only access it through the HTTP API, which would also allow the user to connect to hosted instances of the API.

@vkbo vkbo added enhancement Request: New feature or improvement potential feature Request: May be considered later discussion Meta: Feature discussions labels Dec 8, 2020
@vkbo vkbo self-assigned this Dec 8, 2020
@vkbo
Copy link
Owner Author

vkbo commented Dec 8, 2020

There is already a Python wrapper on PyPi that seems to be actively developed: https://pypi.org/project/language-tool-python/

This may be sufficient for the integration. Needs testing.

@alexisargyris
Copy link

Very interesting possibility

@kyrsjo
Copy link

kyrsjo commented Dec 9, 2020

Cool! It may be possible to learn from what TeXstudio ( https://www.texstudio.org/ ) is doing -- they integrate LanguageTool, and they can launch it on demand. Seems to work well.

It is a bit heavy to run locally tough -- I'm using it with the full English ngram data, which takes ~10 GB of disk space and 6 GB of RAM (although the last one may just be that it expands if there is space, and there is lots of space here).

@TheJackiMonster
Copy link

TheJackiMonster commented Feb 17, 2021

You can also use https://pypi.org/project/language-check/ which works fine. The Python wrapper essentially forwards its calls to the Java application running locally.

For huge text files (which are typical for novels) I wouldn't recommend using it remotely because it will add huge latency. Even using it locally takes time depending on the project size.

I know this because I have implemented integration for this in Manuskript: olivierkes/manuskript#747

You basically want to preprocess most files of your novel and cache the results from LanguageTool because dynamically requesting files, paragraphs or similar takes too much time.

Edit: I have to correct... don't use language-check but use language-tool-python. The one is just the abandoned original with some unfixed issues.

@jyhelle
Copy link
Contributor

jyhelle commented Mar 17, 2021

For the French texts I prefer to use Grammalecte (an evolution from LightProof) rather than Language Tool (they share the same lexical base as the two developers work in parallel) But since they don't find the same errors, one can use both...

Grammalecte is written in Python and has been integrated in several apps, but only exists in French flavor. It is integrated in LibreOffice (also Openoffice, but it was late on versions last time I used it) and we also have it into Sigil, so I won't cry to have it integrated into nW

Just to be complete : I noticed there exists pygrammalecte, a Grammalecte wrapper in Python https://pypi.org/project/pygrammalecte/

@vkbo
Copy link
Owner Author

vkbo commented Mar 24, 2021

For huge text files (which are typical for novels) I wouldn't recommend using it remotely because it will add huge latency. Even using it locally takes time depending on the project size.

I know this because I have implemented integration for this in Manuskript: olivierkes/manuskript#747

You basically want to preprocess most files of your novel and cache the results from LanguageTool because dynamically requesting files, paragraphs or similar takes too much time.

How slow would it be if you ran it on a single file instead of the whole novel? Say a scene file of ~1000 words or a chapter of ~5000 words?

I wouldn't run it real time I think. Maybe as a command button.

Now that the internationalisation of novelWriter is near complete, this is one of the features I'm considering looking into next.

@TheJackiMonster
Copy link

@vkbo The problem is that LanguageTool uses a server to process each text. Even if you use it locally you have to add in latency from transferring it to the server and receiving its results. So using it remotely will hugely depend on the servers keeping up with potentially multiple people trying to process their texts at the same time, their individual internet connections and bandwidth limitations on both ends. Most of this latency is independent of the actual size of the text because it is an additional overhead.

Sure if you have a very low bandwidth, it is worse to transfer huge text files but transferring many small pieces instead adding overhead for each request makes it worse.

If possible you should separate the changed passages in the text (changed sentences) and transfer them all together. Then cache or store the responses locally so you don't have to request as much every time.

Even if you add a command button, users will tend to spam the button if the the process takes too long (I think I had about 1~3 seconds in some tests locally - so not even remotely). That's the reason why I would recommend to process as few as possible on change (best in a second thread, so it doesn't affect input latency of writing). If you add an explicit button which have to be pressed first, users will notice the time of processing even more. ^^'

@TheJackiMonster
Copy link

How slow would it be if you ran it on a single file instead of the whole novel? Say a scene file of ~1000 words or a chapter of ~5000 words?

Also because you have asked about specific numbers. I think this count is quite unproblematic locally but it can still be noticed remotely (so it could take about 1 second in situations, I assume...). However if you don't want to restrict users to a specific size in their chapters, I wouldn't recommend taking this approach rather than picking changes out of the text (which is usually much less).

@vkbo
Copy link
Owner Author

vkbo commented Mar 24, 2021

novelWriter has an upper limit on file size of 5 MB, which is when the Python/Qt interface starts to get into real issues due to the syntax highlighter running on the GUI thread. There's also a user-defined soft cap that defaults to 800 kb that disables some automated features like full document spell checking to reduce load on the syntax highlighter. If this tool is integrated with the highlighter, but runs on a set of rules processed and cached by an off-GUI thread, and also obeys the soft cap, this may work.

I'll have a look at your manuskript implementation, but I'm not very familiar with that project.

I've tried to keep the editor as lightweight as possible to avoid latency when the user is writing. Python is after all very slow on real time stuff. I am considering placing text analysis in a separate dialog box entirely, with its own highlighter, and have it update the editor's text when completed. A bit like the traditional spell checking dialog tool in office apps. It aligns more with the distraction free philosophy to keep them separate and not clutter the text with all sorts of highlights when you're focused on getting "words to paper".

I don't know. I need to think a bit more about this. Thanks a lot for taking the time to provide feedback and insights.

@jyhelle
Copy link
Contributor

jyhelle commented Mar 24, 2021

I've tried to keep the editor as lightweight as possible to avoid latency when the user is writing. Python is after all very slow on real time stuff. I am considering placing text analysis in a separate dialog box entirely, with its own highlighter, and have it update the editor's text when completed.

That's the way I prefer.
The act of writing is much personal but in my own case I first type "by the kilometer" in no distraction mode, then take some rest before re-reading and performing those spelling/syntax/typography checks. So my typing and editing sessions are distinct and often separated by one night or so. (And the book formatting begins much later, when most of the story has been written, checked and had its first proofreading... I can't understand people spending time to format their initial page while the first chapter isn't yet typed.)
So I am not afraid by that sort of tool being run alone from a specific dialog and needing some time to run through the full text.

@Ryex
Copy link
Contributor

Ryex commented Apr 30, 2022

I've been looking into doing some work on this potential feature over the last day or so and it seems entirely feasible especially with the use of language-tool-python.

he work done in olivierkes/manuskript#747
is a great starting point. it uses language-check and so it's patterns is mostly applicable to using language-tool-python (a fork of language-check). referencing this work would follow the path of modeling the language tool integration off the current spellcheck in novelWriter. adding new highlighter colors and classes for errors outside of misspellings etc.

There are some important questions to answer in regard to the approach however.

  • How should we go about running LanguageTool?
    • language-tool-python can and does download a local server .jar and runs it if it is not given an alternative configuration and it codes this the first time the class is instantiated (~200Mb .jar). This seems to work quite well but the download time would cause the editor to unexpectedly hang the first time the feature is used unless this is taken care of during installation. it also requires Java be installed.
    • From digging into the codepaths there is no clean way to bypass the check that downloads the local server and the best we could do is wrap the tool in our own checks and ask the user to let us download the server jar and provide our own progress bar.
    • Allowing the configuration of a custom external server could be useful for some users but use of the LangaugeTool public API should be discouraged due to the rate limiting and poor performance. However it may be possible to use the public API if checks are done block by block and cached locally.
    • the LangaugeTool server is highly configurable. determining how much of the configuration to expose to the user should be a discussion. (check threads, cache size and TTL, rule IDs to disable, etc.)
  • Should the first step to be to add a dialog tool to iterate through matches and offer replacements or should the effort be made to extend the current spellcheck features to highlight LanguageTool errors in the editor?
  • From reading documentation and code I'm not sure accounting for the custom dictionary words via language tool is quite as easy and it may require rule matches to be manually filtered. but there is a configuration value called "newSpellings" which may provide and needs investigation.

In any case I'd be interested in doing work on this feature if a simi-clear plan could be made for it's implementation

@vkbo
Copy link
Owner Author

vkbo commented May 1, 2022

Thanks for offering to help on this @Ryex. With other and more pressing features to implement, this one has been on hold for quite some time as you can see from the time stamps. I am not very familiar with these toolsets in the first place, so there is a bit of a threshold for me to get started on this. I would be very happy for some assistance on this!

I am uncertain how it would be best to implement this in practice. The current spell checker is integrated into the syntax highlighter, and is a major bottleneck on large documents when they are first opened. Since it is not currently possible to run the syntax highlighter off the GUI thread, this becomes even trickier. The highlighter, when only used for actual highlighting though, is very fast. And the updates are done on a line by line basis. That is, every time a line is changed, that line is re-highlighted and re-spellchecked.

As I've indicated, there is a performance issue associated with the initial spell check of a large document when it is opened. For any text analysis integration, finding a suitable way to run the tool in the background, and caching between editing sessions, is probably essential for a smooth experience. We could implement a cache for the regular spell checker first to see if it can be done. Thankfully it is fairly easy to trigger an update on a paragraph change using the QTextEditor or QTextDocument signals for this. There is already a thread pool that is currently only used for the word counter that could also be used for queueing up both spell check tasks and analysis tasks. The highlighter would then just read from a buffer of pre-computed highlight regions.

As for your points specifically:

  • I think the LanguageTools integration should be installed separately and handled by a dialog box so that it is understood as an optional third party tool that novelWriter interacts with. I don't want to create a plugin framework per say, but I would like to have a framework in place that can run user-selected processing tasks. I want to add a simple text cleanup feature as a base case for this (see Add a general text cleanup and checking tool #1045).
  • For your second main point, I'm uncertain on how to integrate such a tool in a user-friendly way. I don't like the idea of an additional dialog box or tool window on top of the editor to do editor tasks. Although the spell checking dialog in LibreOffice and the like has some merit. I think it will depend on how complicated it is. I have some ideas, but I'll get back to this below.
  • I think a custom words feature should be useful. Especially for writers, like myself, who dabble in sci-fi and fantasy where we frequently make up new words. As long as it can be implemented without having to pull in a lot of the advanced logic in these tools anyway.

GUI Design Ideas

Say we want to add these features "on top" of the current editor, so that the text can be dynamically edited in-place. I think perhaps a coloured gutter bar in the text margin could show which paragraph is being analysed. An expandable panel below the editor window, similar to the references panel in the viewer, can hold the needed information for the tool. I already want to add a "Problems" list that can report various errors in your text as an alternative, or addition, to the underlines. Much like the Problems tab in VSCode does.

The LanguageTool analysis interface can occupy another tab in this panel, and have a few real-time settings (if needed) and a prev/next button to iterate through paragraphs, and provide its feedback and proposed changes if it produces any (I haven't played with these tools in a while).

Implementation

  • Add a cache feature for the highlighter. This can be implemented with the spell checker as a test case, and should be designed such that other tools can request highlightings for it. The cache could be a simple Python dictionary that is stored in the cache folder in the project as a JSON file for each text document. This is an idea I've been considering for a while anyway to solve the large file bottleneck that currently exists for spell checking. All misspelled words could be saved in this dictionary, and the wiggly line only applied to those listed during load time. The QTextEdit widget with highlighting enabled can handle huge documents as long as there isn't heavy tools like the spell checker blocking the execution. I could then remove the size restrictions currently in place.
  • Add a panel below the editor to list "Problems" in the document. The info in the problems list should be taken directly from the above mentioned cached dictionary for fast updates on load.
  • With the two above features in place, LanguageTools can be plugged into this with its own tab next to the Problems tab, if it needs such a tab, and its own entries in the Problems list for quick access.
  • A separate tool to enable, download, run and configure LanguageTool needs to be added. I want to add a new dialog anyway for "Text Analysis Settings" where I want to move over the spell checking bits from the main Preferences as well. I also want to add a feature to download new dictionaries from online sources, mostly as a service to Windows users (see Spell check tool and adding dictionaries #982). I also want to add some basic online dictionary and thesaurus integration (see Online lookup from editor #763).
  • I also want to add @jyhelle's request in Readability metrics #712 somewhere in all of this. However, I may add that in the Build Tool and apply it to the compiled manuscript rather than to individual documents. Although both could be done since it could use the same underlying algorithm. These results can also be cached. We already compute the shasum of the text, so detecting changes to the whole text to invalidate cached results is easy.

So, in conclusion, there are a bunch of other features that can be tied together to make a more complete toolset for processing text. All of these would benefit from a redesign of the syntax highlighter and the addition of a panel below the editor. It would also create a framework where new features could be added. I do want this to be modular enough that the user can be provided with a selection of options that supports other languages than English.

@Ryex
Copy link
Contributor

Ryex commented May 1, 2022

I have to admit, when I was looking into it I was a bit put off by the tight integration of the spell check and the highlighter. My first instinct was to separate the two before moving on. I'm glad to see that concern is shared.

I think aiming for the modular approach first is the smart move. While slapping on a LibreOffice-esk dialog would be relatively easy it would by no means be user friendly or clean.

Proposal

  • Step one in this would be to write up a generic "text problem" interface that all tools would work with.
    Here is an example
    # TextProblem
    {
      "type": "novelWriter.misspelling", # a ID/Flag for the type of issue, could be mapped to a highlighter class
      "message": "",  # OPTIONAL message for the issue ie. the LanguageTool rule explanation. not displayed if left empty.
      "offset": 42, # offset from start of document (or block if these issues are stored per block)
      "context": "An exmpl issue that a tool may report", # context of content around the issue, provided by the tool with it's report 
      "context_offset": 3, # offset form start of provided context
      "length": 5, # length of detected issue
      "replacements": [  # List of possible replacements that would fix the issue
        "example"
      ]
    }
    This seems like the minimum information that would be needed
  • A List (ordered by their offset in the document) of these issues could be stored and cashed with a hash of the text to serve as a way of invalidating them. If stored per block (paragraph) the cache would become much more useful.
  • Special care would be needed in cases where the offset length regions overlap.
    • A naive implementation could assume that issues would be resolved one at a time and the cache invalidated and the check on the block rerun each time. This may in fact be the best way as the alternative is the use of a list of interval trees when storing issues and invalidating the tree when an issue is resolved
  • Tools could be run off thread and return a list of issues to this interface. The list of issues could then be used to inform the highlighter and populate a VSCode like panel.

If this is the Path to go this discussion should probably be split off into another issue. Feel free to ping me as I would like to work on this. it would make my writing process much smoother.

@vkbo
Copy link
Owner Author

vkbo commented May 2, 2022

When I first started this project back in 2018, I hadn't written much in Python (I was primarily working in Fortran at the time), and nothing in Qt, so there are a lot of old implementations that are not at all optimal. Some are also related to supporting pre 3.6 Python versions. I am slowly rewriting core parts of the code to be more Pythonic and reflective of newer Python releases. Since 3.6 is also now dropped, it may be a good time to start adding typing info. It makes it easier to collaborate on the code.

The spell checker integration was fine when it was only spell checking, and it was assumed users would only have single documents of maximum a few thousand words. After all, a lookup is made on every single word during highlighting, so it is fairly obvious that this doesn't scale well.

It is possible to store meta data in the editor on a per-block basis. ID-ing a block by its block position is not really viable since this is merely an index in a list, and they will change all the time when the user inserts text, causing the "Problems" dictionary to have to be updated. I'm wondering if instead of storing a single cache dictionary in memory that there should instead be a meta data object stored with the text block itself in the QTextDocument. The data can be piped to a separate file on save, or dumped at the bottom of the file as a serialised JSON. Then we don't have to consider how to associate a text block with its custom meta data. During save, the block ID is frozen, and can be used for the meta data. It will increase the save time slightly, but I doubt it would be noticeable to the user.

Overlapping regions is a non-issue from the highlighter's point of view (although it may be visually messy). The highlighter uses character format merging to set the format.

Proceeding:

  • I will make an issue on splitting off the spell checker into a framework like we are discussing. We can discuss the meta data format there to make sure it is suitable for both use cases.
  • I will also add the Problems panel and use it to display spell checking data and formatting information as per Add a general text cleanup and checking tool #1045.
  • I think I also will split off all of the non-GUI code here into a new subpackage for text analysis. At present it is only the spellchecker, and it currently lives in the core subpackage with the rest of the non-GUI code.

Since this is a rather large rewrite of core features, I will set up a milestone for it and pull in the relevant issues. I can also set up a project (kanban) for tasks if you prefer to work this way since we may be splitting tasks.

I already have release 1.7 and 1.8 planned, so this could be suitable for a 1.9 release. The changes here should not interfere much with 1.8 which focuses only on the "Build Project Tool", and most of 1.7 is already in main. 1.7 is mostly a rewrite of the project data structure to lift a lot of restrictions, and solve a few blockers for 1.8.

The timeline here is thus on the order of around half a year, give or take. I do a minor release every few months, depending on how much spare time I have to spend on this. I could make a branch for this already now so it is possible to start without interfering with the 1.8 release.

@vkbo vkbo added this to the Release 1.9 Beta 1 milestone May 3, 2022
@vkbo vkbo added the editor Component: Editor label May 31, 2022
@xahodo
Copy link

xahodo commented Mar 13, 2023

I'd like to add another perspective: that of iA writer.

iA writer is actually quite a nice program, does markdown (with some creature comfort additions). It's basically a text editor geared to writers of documents. It does its job quite well... but, alas, it doesn't support Linux :(

iA writer doesn't do actual grammar/style checking, instead it just highlight the nouns/verbs/adverbs/adjectives/etc. (whichever you select in a nice drop-down menu), and let the grammar and style checking to the user. This allows the application to perform much better, and there's no need for referring some web-service or including a multi-gigabyte program in the download.

LanguageTool integration would be overkill, in my humble opinion.

@vkbo
Copy link
Owner Author

vkbo commented Jan 31, 2024

I am also thinking this is overkill. I don't think I want to move novelWriter down this route at all, considering the current "AI" hype (i.e. rebranded LLM).

I do plan to add a text analysis framework which I hope to design in a plug-in like manner. It can include anything from counters, to statistics to text analysis implementations and the user can select and run them on single documents or the entire project. I think it is a better approach, and users with backgrounds from different languages can contribute language-specific ones.

@vkbo vkbo closed this as not planned Won't fix, can't repro, duplicate, stale Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Meta: Feature discussions editor Component: Editor enhancement Request: New feature or improvement potential feature Request: May be considered later
Projects
None yet
Development

No branches or pull requests

7 participants