Stripping curly brackets is too greedy #180

pbelmans · 2017-09-18T13:22:32Z

In #158 a customization to strip { and } from fields was introduced. The problem with the greedy approach in the current implementation is that it will also replace curly brackets in mathematical expressions, and these are occasionally used in titles of papers.

The solution would be to only replace curly brackets in text mode, i.e. iterate over the string and keep track of text or math mode, and only then replace curly brackets.

If you ignore the abstract field, then the only thing you need to worry about is $ or $ and $ for inline math. Inside the abstract field (and maybe similar fields if there are any), anything can happen, and one is royally screwed in coming up with an implementation.

The text was updated successfully, but these errors were encountered:

Phyks · 2017-09-18T14:20:03Z

Thanks for opening an issue on this!

I thought more about it and found a couple of related issues:

There could be an issue with things such as \url{http://example.com} which I sometimes saw in BibTeX entries. It would be translated into \url http://example.com with current code.
There might be an issue with the convert_to_unicode code also.
Also, title={An FFT Algorithm} will yield An FFT Algorithm whereas it is An Fft Algorithm in LaTeX. Not sure how expected this is.

I think we should try to list some examples snippets for now, before actually dealing with this issue, to ensure the solution will cover theses cases. I don't think we already have BibTeX snippets which cover these cases in our test base.

Concerning the solution itself, there is a debate to have concerning whether the actual output of plaintext_* should be plaintext (which I think is the way to go for the use case in #116) or just LaTeX without curly braces. First solution is difficult to achieve though :/

omangin · 2018-10-07T04:12:14Z

See also #193:

latex_to_unicode customization should preserve escaped braces
See https://github.com/sciunto-org/python-bibtexparser/blob/master/bibtexparser/latexenc.py#L70 and #187.

WouterJeuris · 2019-10-23T12:08:38Z

Hey. Just dropping in here as a non-python developer and non-LaTeX user. So this comment might be uninformed. But is it possible you're focussing too heavily on the "brackets"? To me this looks like a LaTeX2e encoded string, and in python there are very good packages available to convert those to text.

This one seems to be the most prominent one: https://pylatexenc.readthedocs.io/en/latest/latex2text/

I think I have implemented it successfully as a customization somewhat as follows

import pylatexenc
from pylatexenc.latex2text import LatexNodes2Text

def latex_to_text(record):
    record = {key: LatexNodes2Text().latex_to_text(value) for key, value in record.items()}
    return record

If the text contains what your documentation calls "accents and weird characters" it seems to imply that it's LaTeX encoded, and hence will contain a lot more weird stuff than just the brackets that are being focussed on here ...

Hope this is of any help!
Thx for the very useful toolbox!

pbelmans · 2019-10-23T12:15:10Z

Appealing to an external library is a good way of letting someone else deal with the special situations. But in any case you'll need to add math_mode='verbatim' as an option, otherwise the whole point of not stripping curly brackets in math mode is defeated.

MiWeiss · 2023-05-26T14:02:13Z

We're using pylatexenc as external library for now in v2. This may not be ideal (it's rather slow and not bibtex specific), but seems to be working well for all test cases reported so far.

If anyone wants to submit a fix for v1, I am happy to review it, but it seems to be a rather big change needed; it's probably easier to just migrate to v2.

omangin mentioned this issue Oct 7, 2018

latex_to_unicode customization should preserve escaped braces #193

Closed

omangin added the bug label Oct 7, 2018

MiWeiss closed this as completed May 26, 2023

MiWeiss added the fixed in v2 label May 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stripping curly brackets is too greedy #180

Stripping curly brackets is too greedy #180

pbelmans commented Sep 18, 2017

Phyks commented Sep 18, 2017

omangin commented Oct 7, 2018

WouterJeuris commented Oct 23, 2019

pbelmans commented Oct 23, 2019

MiWeiss commented May 26, 2023 •

edited

Stripping curly brackets is too greedy #180

Stripping curly brackets is too greedy #180

Comments

pbelmans commented Sep 18, 2017

Phyks commented Sep 18, 2017

omangin commented Oct 7, 2018

WouterJeuris commented Oct 23, 2019

pbelmans commented Oct 23, 2019

MiWeiss commented May 26, 2023 • edited

MiWeiss commented May 26, 2023 •

edited