Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stripping curly brackets is too greedy #180

Closed
pbelmans opened this issue Sep 18, 2017 · 5 comments
Closed

Stripping curly brackets is too greedy #180

pbelmans opened this issue Sep 18, 2017 · 5 comments

Comments

@pbelmans
Copy link

In #158 a customization to strip { and } from fields was introduced. The problem with the greedy approach in the current implementation is that it will also replace curly brackets in mathematical expressions, and these are occasionally used in titles of papers.

The solution would be to only replace curly brackets in text mode, i.e. iterate over the string and keep track of text or math mode, and only then replace curly brackets.

If you ignore the abstract field, then the only thing you need to worry about is $ or \( and \) for inline math. Inside the abstract field (and maybe similar fields if there are any), anything can happen, and one is royally screwed in coming up with an implementation.

@Phyks
Copy link
Collaborator

Phyks commented Sep 18, 2017

Thanks for opening an issue on this!

I thought more about it and found a couple of related issues:

  • There could be an issue with things such as \url{http://example.com} which I sometimes saw in BibTeX entries. It would be translated into \url http://example.com with current code.
  • There might be an issue with the convert_to_unicode code also.
  • Also, title={An FFT Algorithm} will yield An FFT Algorithm whereas it is An Fft Algorithm in LaTeX. Not sure how expected this is.

I think we should try to list some examples snippets for now, before actually dealing with this issue, to ensure the solution will cover theses cases. I don't think we already have BibTeX snippets which cover these cases in our test base.

Concerning the solution itself, there is a debate to have concerning whether the actual output of plaintext_* should be plaintext (which I think is the way to go for the use case in #116) or just LaTeX without curly braces. First solution is difficult to achieve though :/

@omangin
Copy link
Collaborator

omangin commented Oct 7, 2018

See also #193:

latex_to_unicode customization should preserve escaped braces
See https://github.com/sciunto-org/python-bibtexparser/blob/master/bibtexparser/latexenc.py#L70 and #187.

@omangin omangin added the bug label Oct 7, 2018
@WouterJeuris
Copy link

Hey. Just dropping in here as a non-python developer and non-LaTeX user. So this comment might be uninformed. But is it possible you're focussing too heavily on the "brackets"? To me this looks like a LaTeX2e encoded string, and in python there are very good packages available to convert those to text.

This one seems to be the most prominent one: https://pylatexenc.readthedocs.io/en/latest/latex2text/

I think I have implemented it successfully as a customization somewhat as follows

import pylatexenc
from pylatexenc.latex2text import LatexNodes2Text

def latex_to_text(record):
    record = {key: LatexNodes2Text().latex_to_text(value) for key, value in record.items()}
    return record

If the text contains what your documentation calls "accents and weird characters" it seems to imply that it's LaTeX encoded, and hence will contain a lot more weird stuff than just the brackets that are being focussed on here ...

Hope this is of any help!
Thx for the very useful toolbox!

@pbelmans
Copy link
Author

Appealing to an external library is a good way of letting someone else deal with the special situations. But in any case you'll need to add math_mode='verbatim' as an option, otherwise the whole point of not stripping curly brackets in math mode is defeated.

@MiWeiss
Copy link
Collaborator

MiWeiss commented May 26, 2023

We're using pylatexenc as external library for now in v2. This may not be ideal (it's rather slow and not bibtex specific), but seems to be working well for all test cases reported so far.

If anyone wants to submit a fix for v1, I am happy to review it, but it seems to be a rather big change needed; it's probably easier to just migrate to v2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants