New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stripping curly brackets is too greedy #180
Comments
Thanks for opening an issue on this! I thought more about it and found a couple of related issues:
I think we should try to list some examples snippets for now, before actually dealing with this issue, to ensure the solution will cover theses cases. I don't think we already have BibTeX snippets which cover these cases in our test base. Concerning the solution itself, there is a debate to have concerning whether the actual output of |
See also #193:
|
Hey. Just dropping in here as a non-python developer and non-LaTeX user. So this comment might be uninformed. But is it possible you're focussing too heavily on the "brackets"? To me this looks like a LaTeX2e encoded string, and in python there are very good packages available to convert those to text. This one seems to be the most prominent one: https://pylatexenc.readthedocs.io/en/latest/latex2text/ I think I have implemented it successfully as a customization somewhat as follows
If the text contains what your documentation calls "accents and weird characters" it seems to imply that it's LaTeX encoded, and hence will contain a lot more weird stuff than just the brackets that are being focussed on here ... Hope this is of any help! |
Appealing to an external library is a good way of letting someone else deal with the special situations. But in any case you'll need to add |
We're using pylatexenc as external library for now in v2. This may not be ideal (it's rather slow and not bibtex specific), but seems to be working well for all test cases reported so far. If anyone wants to submit a fix for v1, I am happy to review it, but it seems to be a rather big change needed; it's probably easier to just migrate to v2. |
In #158 a customization to strip
{
and}
from fields was introduced. The problem with the greedy approach in the current implementation is that it will also replace curly brackets in mathematical expressions, and these are occasionally used in titles of papers.The solution would be to only replace curly brackets in text mode, i.e. iterate over the string and keep track of text or math mode, and only then replace curly brackets.
If you ignore the
abstract
field, then the only thing you need to worry about is$
or\(
and\)
for inline math. Inside theabstract
field (and maybe similar fields if there are any), anything can happen, and one is royally screwed in coming up with an implementation.The text was updated successfully, but these errors were encountered: