Message generation by the basic LaTeX filter #180

matze-dd · 2021-02-01T13:22:56Z

Currently, the basic filter yalafi provides the plain text and a map for character positions. In order to indicate certain problems as LaTeX-related errors, a special mark has to be inserted in the plain text that then is detected by the proofreading software.

We should add a third "channel" between yalafi and yalafi.shell with complete messages that can be inserted just as currently for option --single-letters (which is implemented in yalafi.shell). When using the base filter yalafi in isolation, one could save these messages in a JSON file.

We first need to resolve issue #169.

A first application could be direct injection of LaTeX-related problems (cf. issue #177), then we could shift processing of option --single-letters and --equation-punctuation to core yalafi, and finally we perhaps could generate better messages for problems in displayed equations (for instance, issue #158).

In all cases, we would not need to insert special marks (or repeated words) to be found by the proofreading program, any more.

The text was updated successfully, but these errors were encountered:

torik42 · 2021-02-01T18:00:52Z

This is exactly what I intended in #177 (see my comment there). But I will stick to your separation of the issues. Now to this one.

I do not yet understand how all the code works together. Anyway I will put some thought here

Could the parser collect all errors and report them together with the tokens (I guess line 56 in tex2txt.py)?
Could one insert a special error token which contain the error properties (i.e. message, ID, short description, …) but replaces to an empty string? Later the tokens can be searched for this error token and the messages appended to the messages from the proofreader.

Although I wrote ‘empty string’ above, I think it would be nice if some kind of mark is written to the output in case it produces extra errors. This could be made optional. Also, it is helpful in plain text output to have these. One could also think about removing all LT errors which report a misspelled LATEXXXERROR to not have duplicate messages.

matze-dd · 2021-02-02T09:01:39Z

Thank you for the thoughts!

Could the parser collect all errors and report them together with the tokens (I guess line 56 in tex2txt.py)?

Yes, this is roughly the plan. (But tex2txt returns the plain text as string(s).)

Could one insert a special error token which contain the error properties ...

See first point.

Also, it is helpful in plain text output to have these.

This would then be done "automatically" by yalafi.shell, once it is ready to read from the "third channel".

... This could be made optional. Also, it is helpful in plain text output to have these. One could also think about removing all LT errors which report a misspelled LATEXXXERROR to not have duplicate messages.

Yes, I was already thinking about these points, too. There are some subtle interdepencies to be taken into account.

torik42 · 2021-02-02T12:28:19Z

Thank you for all the answers. I already played a little with this idea yesterday:

Could one insert a special error token which contain the error properties (i.e. message, ID, short description, …) but replaces to an empty string? Later the tokens can be searched for this error token and the messages appended to the messages from the proofreader.

It should also work pretty well without too many modifications to the code. Here is a quick sketch. Don’t take it too seriously, but it already yields reasonable results. I can put more effort into this next week. Just let me know. In the end you know the code a lot better.

The mockup ErrorToken, the offset is changed later.

class ErrorToken(TextToken):
    def __init__(self, pos, txt, id, short_msg, msg, pos_fix=True):
        super().__init__(pos, txt)
        self.error = {
            'offset': 0,
            'length': 1,
            'context': {
                'text': 'sometext',
                'offset': 0,
                'length': 1,
            },
            'rule': {'id': id, 'category': {'name': 'YY_LATEX_ERROR'}},
            'message': msg,
            'replacements': [],
        }

In tex2txt I replace txt, pos = utils.get_txt_pos(toks) with txt, pos, err = utils.get_txt_pos_err(toks) (line 60) defined by

def get_txt_pos_err(toks):
    txt = ''
    pos = []
    errors = []
    for t in toks:
        txt += t.txt
        if type(t) is defs.ErrorToken:
            error = t.error
            error['offset'] = len(pos)
            errors.append(error)
        if t.pos_fix:
            pos += [t.pos] * len(t.txt)
        else:
            pos += list(range(t.pos, t.pos + len(t.txt)))
    return txt, pos, errors

And also output the error return txt, pos, err (line 68).
In proofreader.py I catch those plain, charmap, err = tex2txt.tex2txt(tex, t2t_options) (line 82) and add them to the matches later matches += err.
In the definition of latex_error(err, pos, latex, parms) I exchange the Token defs.TextToken(pos, mark[:mx], pos_fix=True) with defs.ErrorToken(pos, mark[:mx], 'YY_ERROR', '', err) to create an Error token for any error.

Too also support errors for nested calls (i.e. \LTinput) I need to filter also for type ErrorToken in h_load_defs.

So far, this obviously only works for certain settings (e.g. no multilanguage) but the changes should be straight forward.

EDIT: By the way, this would show you errors of nested \LTinput calls at the position of the top most call. One could add the actual file path in the error message.

matze-dd · 2021-02-02T18:44:13Z

This is in principle the scheme I'm also thinking about. As you also point out, there are quite some things to consider, if one wants to achieve a solution of good quality. In part, this is also due to the "not so tidy" internal structure of the software.

So, if you need a quick solution for your application, then you are of course free to modify the tool for your needs. On the other hand, this is a hobby project, and I would like to "hack" the core parts on my own. (I see an effort of more than a few days. In other words, I'd like to slow down.)

By the way, 'token' is a technical term in compiler building. It is perhaps misplaced for the messages between the core LaTeX filter and an application like yalafi.shell.

torik42 · 2021-02-02T18:58:26Z

This is in principle the scheme I'm also thinking about. As you also point out, there are quite some things to consider, if one wants to achieve a solution of good quality. In part, this is also due to the "not so tidy" internal structure of the software.

So, if you need a quick solution for your application, then you are of course free to modify the tool for your needs. On the other hand, this is a hobby project, and I would like to "hack" the core parts on my own. (I see an effort of more than a few days. In other words, I'd like to slow down.)

Sure. I just don’t want to open issues and expect that you solve them. There is also really no need to hurry. Have fun hacking!

By the way, 'token' is a technical term in compiler building. It is perhaps misplaced for the messages between the core LaTeX filter and an application like yalafi.shell.

I don’t know compiler building at all. But I called it ErrorToken because it’s a subclass of TextToken and you called all these …Token.

matze-dd added the enhancement New feature or request label Feb 1, 2021

matze-dd mentioned this issue Feb 1, 2021

Map LaTeX error to LanguageTool spelling mistake #177

Open

torik42 mentioned this issue Aug 2, 2022

Status of the project #226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message generation by the basic LaTeX filter #180

Message generation by the basic LaTeX filter #180

matze-dd commented Feb 1, 2021 •

edited

Loading

torik42 commented Feb 1, 2021

matze-dd commented Feb 2, 2021

torik42 commented Feb 2, 2021 •

edited

Loading

matze-dd commented Feb 2, 2021

torik42 commented Feb 2, 2021

Message generation by the basic LaTeX filter #180

Message generation by the basic LaTeX filter #180

Comments

matze-dd commented Feb 1, 2021 • edited Loading

torik42 commented Feb 1, 2021

matze-dd commented Feb 2, 2021

torik42 commented Feb 2, 2021 • edited Loading

matze-dd commented Feb 2, 2021

torik42 commented Feb 2, 2021

matze-dd commented Feb 1, 2021 •

edited

Loading

torik42 commented Feb 2, 2021 •

edited

Loading