Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: missing field breaks import #2377

Closed
tplobo opened this issue Jan 13, 2023 · 27 comments
Closed

[Bug]: missing field breaks import #2377

tplobo opened this issue Jan 13, 2023 · 27 comments

Comments

@tplobo
Copy link

tplobo commented Jan 13, 2023

Debug log ID

J2GQXW4D-refs-euc

What happened?

The function "Import from Clipboard" stopped working for me. Currently using Zotero 6.0.19, and disabling BetterBibTeX makes the function work again.

I semi-automatically create items using this function and lines of text created automatically by the browser, such as the example below:

@techreport{TBD,  title = {Report number 2PTE3 - 3D transport model under operative regime},  , number = {2PTE3},  institution = {{Research Institute X}},  url = {https://website.org/?uid=2PTE3},  Accessed = {2023-01-13}}

Now, when attempting an import, Zotero creates a note instead with comments such as:
"
Import errors found:

  • line 1, column 127: Expected "%", Optional Horizontal Whitespace, Optional Whitespace, [\r\n], [_:a-zA-Z0-9-], or [})] but "," found.
    "

Just to be clear, the found error is not always the same.

I already tried selecting "no" in both "Sentence-case titles on import" and "Insert case-protection for braces", but it does not change the problem. I did notice, however, that after attempting an import the selector for "Sentence-case titles on import" automatically changes to "yes".

Any ideas how to remedy this?

@tplobo tplobo changed the title [Bug/Feature]: "Import from Clipboard" stopped working [Bug]: "Import from Clipboard" stopped working Jan 13, 2023
@retorquere
Copy link
Owner

I can see if I can make the parser robust against this, but in the sample you posted above you have

regime},  , number

note the two commas, indicating a missing field.

@retorquere
Copy link
Owner

That's just not valid bibtex, both bibtex (renders with errors) and biblatex (does not render) complain about this input.

@retorquere
Copy link
Owner

retorquere commented Jan 13, 2023

I did notice, however, that after attempting an import the selector for "Sentence-case titles on import" automatically changes to "yes".

That should not happen. Looking into that. You do currently have it set to yes, but try to exclude already-sentence-cased titles though.

@tplobo
Copy link
Author

tplobo commented Jan 13, 2023

Hi @retorquere, thanks for the reply.

The double comma was always there and it was never a problem in previous versions of BBT. It happens because the bibtex creation routine in the browser is not able to extract the author on some pages.

In any case, when BBT is disabled, it is not an issue for Zotero's "Import from Clipboard". Unfortunately I cannot tell you in which version this issue started because I have just noticed it now.; BBT must have updated several times before.

About the selector, I usually have the yes, but try to exclude already-sentence-cased titles option selected. I am just reporting what happened when I tried changing it to no to try and solve the "Import from Clipboard" issue.

@retorquere
Copy link
Owner

This part of the parser hasn't changed in ages, I'm not sure why it surfaced only recently for you, but I'll see what I can do.

@tplobo
Copy link
Author

tplobo commented Jan 13, 2023

Thanks @retorquere, let me know if you need me to make any more tests.

retorquere added a commit that referenced this issue Jan 13, 2023
@github-actions
Copy link

🤖 this is your friendly neighborhood build bot announcing test build 6.7.46.3637 ("fixes #2377")

Install in Zotero by downloading test build 6.7.46.3637, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

@retorquere retorquere changed the title [Bug]: "Import from Clipboard" stopped working [Bug]: missing field breaks import Jan 13, 2023
@tplobo
Copy link
Author

tplobo commented Jan 13, 2023

Just tested test build 6.7.46.3637, but it did not solve the problem. I really don't think the issue is the empty field between commas, because Zotero and BBT were able to handle it previously.

Just for completeness, here is the error I received when trying to "Import from Clipboard" with this build.
"
Import errors found:

  • line 1, column 95: Expected "%", Optional Horizontal Whitespace, Optional Whitespace, [\r\n], [_:a-zA-Z0-9-], or [})] but "," found.

"

@retorquere
Copy link
Owner

retorquere commented Jan 13, 2023

The issue really is the extra comma, that's precisely what the error message is saying (look at the character in position 95 on the first line). What I don't know yet is how come I can import the input from the clipboard with 3637 and you can't. I could reproduce the problem on the release version on BBT, so for me there is really a difference between the two.

@retorquere
Copy link
Owner

Can you send a debug log from 3637?

@tplobo
Copy link
Author

tplobo commented Jan 13, 2023

Just sent the debug log, ID 69NW7EPM-refs-euc.

Also, I tested removing one of the commas where it is doubled, and you are right that the problem does not repeat.
But I also noticed that BBT is formatting some of the text in the title field of the import, because providing a title with underscores substitutes them by </sub> and makes a number into subscript. This did not use to happen either.

@retorquere
Copy link
Owner

Do you have a sample for that underscore issue?

@tplobo
Copy link
Author

tplobo commented Jan 13, 2023

Here you go, already without the double commas:

@techreport {TBD,  title = {ABCD_D_2QCY77 - ReportCode: 3D transport model under operative regime},  number = {2QCY77},  institution = {{Research Institute X}},  url = {https://website.org/?uid=2QCY77},  Accessed = {2023-01-13}}

Do you need a gist?

@retorquere
Copy link
Owner

retorquere commented Jan 13, 2023

The underscore interpretation is correct-ish, bare underscores is not valid latex, and it looks like the stray underscore throws bibtex into math mode (where the underscore means subscript); try compiling this MWE:

\documentclass{article}
%\usepackage[utf8]{inputenc}
\usepackage{apacite}
\bibliographystyle{apacite}
\usepackage{url}

\usepackage{filecontents}
\begin{filecontents}{\jobname.bib}
\documentclass[american]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{babel}
\usepackage{csquotes}

\usepackage[style=apa, backend=biber]{biblatex}

\usepackage{filecontents}
\begin{filecontents*}{my-collection.bib}
@techreport {TBD,  title = {ABCD_D_2QCY77 - ReportCode: 3D transport model under operative regime},  number = {2QCY77},  institution = {{Research Institute X}},  url = {https://website.org/?uid=2QCY77},  Accessed = {2023-01-13}}
\end{filecontents*}

\addbibresource{my-collection.bib}

\begin{document}

\cite{TBD}

\printbibliography

\end{document}
\end{filecontents}

\begin{document}
\nocite{*}
\bibliography{\jobname.bib}
\end{document}

@retorquere
Copy link
Owner

No need for gist, the way you post them here quoted makes it easy for me to copy-paste them.

@tplobo
Copy link
Author

tplobo commented Jan 13, 2023

I did and it is attached below, but maybe you should know I use Overleaf for compiling LaTeX, not my own computer...

TEST.pdf

@retorquere
Copy link
Owner

and makes a number into subscript. This did not use to happen either.

I really have no idea how that came about for you, because this has been BBTs behavior going years back.

@retorquere
Copy link
Owner

I use overleaf too and I get the exact same output. Note the missing spaces and underscores, which is because latex is in math mode.

@retorquere
Copy link
Owner

A new build is underway that adds logging. When you have that installed, please import the double-comma sample and send a new debug log.

@tplobo
Copy link
Author

tplobo commented Jan 13, 2023

Maybe Zotero changed something in the app, that now forces BBT to interpret the import differently...

I could try to update/fix my HTML2BIBTEX routine in the browser, if you tell me what exactly should be the formatting that won't cause problems with the importing function...

The only issue is that I can't really change the document titles and such... So the underscores will keep existing.
But that is not what concerns me. What I want is an easy import of most of the document metadata, the report codes are not my priority. If the import goes back to working, I can survive for the time being.

@retorquere
Copy link
Owner

retorquere commented Jan 13, 2023

Maybe Zotero changed something in the app, that now forces BBT to interpret the import differently...

Unlikely, it's just bare text, and I'm also on 6.0.19

I could try to update/fix my HTML2BIBTEX routine in the browser, if you tell me what exactly should be the formatting that won't cause problems with the importing function...

I still want to fix the double-comma issue, the new log should help there. WRT the underscores, this table lists all characters that must be escaped:

{
  "#": {"math": "\\#", "text": "\\#"},
  "$": {"math": "\\$", "text": "\\$"},
  "%": {"math": "\\%", "text": "\\%"},
  "&": {"math": "\\&", "text": "\\&"},
  "/\u200b": {"text": "\\slash"},
  "<": {"math": "<"},
  ">": {"math": ">"},
  "\\": {"math": "\\backslash", "text": "\\textbackslash"},
  "^": {"math": "\\sphat", "text": "\\^"},
  "_": {"math": "\\_", "text": "\\_"},
  "{": {"math": "\\lbrace", "text": "\\{"},
  "}": {"math": "\\rbrace", "text": "\\}"},
  "~": {"math": "\\sptilde", "text": "\\textasciitilde"},
}

The only issue is that I can't really change the document titles and such... So the underscores will keep existing.

That is unambiguously invalid latex though. And escaping these isn't really hard. If you want I can write a few lines of javascript that will do it for you.

But that is not what concerns me. What I want is an easy import of most of the document metadata, the report codes are not my priority. If the import goes back to working, I can survive for the time being.

I'd really prefer to fix it though. It really bugs me that I have it fixed locally but not for you.

@github-actions
Copy link

🤖 this is your friendly neighborhood build bot announcing test build 6.7.46.3638 ("add logging")

Install in Zotero by downloading test build 6.7.46.3638, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

@tplobo
Copy link
Author

tplobo commented Jan 13, 2023

First, I don't know what you changed in this last version, but it works well now. The problem that did not allow for the double comma disappeared. I am able to import again. I've sent you the log in any case, ID: 2FVFTJWG-refs-euc.

The double comma was always there and it was never a problem in previous versions of BBT. It happens because the bibtex creation routine in the browser is not able to extract the author on some pages.

At the same time, as expected, the title issue remains. As I'd written up here, I use (an ad hoc modification of) this HTML2BIBLATEX snippet. Here's the bookmark I have saved in my browser, which unfortunately has to be a one-liner:

javascript:"use strict";(function(){var _document$querySelect,_document$querySelect2,_document$querySelect3;function copyToClipboard(text){window.prompt("Copy to clipboard: Ctrl+C, Enter",text)}function jsDate2bibTex(date){var dd=date.getDate();var mm=date.getMonth()+1;var yyyy=date.getFullYear();if(dd<10){dd="0"+dd}if(mm<10){mm="0"+mm}return yyyy+"-"+mm+"-"+dd}function date2YearTex(date){var yyyy=date.getFullYear();return""+yyyy}var title=document.title;var url=document.URL.replace('default.aspx','');var number=url.replace('https://idm.euro-fusion.org/?uid=','');var author_tag=document.querySelector("[name=author]");var author=author_tag==null?"":author_tag.content;var today=new Date;var urldate=jsDate2bibTex(today);var publishedTime=(_document$querySelect=document.querySelector('meta[property="article:published_time"'))===null||_document$querySelect===void 0?void 0:_document$querySelect.getAttribute("content");var someTimeTag=(_document$querySelect2=document.querySelector("time[datetime]"))===null||_document$querySelect2===void 0?void 0:_document$querySelect2.getAttribute("datetime");var someTimeTagWithoutDatetime=(_document$querySelect3=document.querySelector("time"))===null||_document$querySelect3===void 0?void 0:_document$querySelect3.innerHTML;var lastModifiedTime=document.lastModified;var pageTime=new Date(publishedTime||someTimeTag||someTimeTagWithoutDatetime||lastModifiedTime);var date=jsDate2bibTex(pageTime);var year=date2YearTex(pageTime);var title_key=title.replace(/[^0-9a-z]/gi,"");var citationKey='TBD';var type="@techreport";var filename=":./references/"+window.location.pathname.slice(1).replace(/\//g,"-")+".html:html";var title_tex=title.replace(/\u00e4/g,'\\"a').replace(/\u00c4/g,'\\"A').replace(/\u00f6/g,'\\"o').replace(/\u00d6/g,'\\"O').replace(/\u00fc/g,'\\"u').replace(/\u00dc/g,'\\"U').replace(/\u00DF/g,'\\"s');var bibTexEntry=type+" {"+citationKey+",\r  title = {"+title_tex+"},\r "+(author?"  author = {"+author+"},\r":"")+" ,\r number = {"+number+"},\r  institution = {{Research Institute X}},\r  url = {"+url+"},\r  Accessed = {"+urldate+"}\r}";copyToClipboard(bibTexEntry)})();

If it isn't much trouble to give me a solution for the underscores, I'd appreciate it. If it's too much effort, really nevermind.

@retorquere
Copy link
Owner

javascript:"use strict";(function(){const mapping={'#':'\\#','$':'\\$','%':'\\%','&':'\\&','/':'\\slash','<':'$<$','>':'$>$','\\':'\\textbackslash','^':'\\^',_:'\\_','{':'\\{','}':'\\}','~':'\\textasciitilde'};function text2latex(t){return t.replace(/./g,c=>mapping[c]||c)}var _document$querySelect,_document$querySelect2,_document$querySelect3;function copyToClipboard(text){window.prompt("Copy to clipboard: Ctrl+C, Enter",text)}function jsDate2bibTex(date){var dd=date.getDate();var mm=date.getMonth()+1;var yyyy=date.getFullYear();if(dd<10){dd="0"+dd}if(mm<10){mm="0"+mm}return yyyy+"-"+mm+"-"+dd}function date2YearTex(date){var yyyy=date.getFullYear();return ""+yyyy}var title=document.title;var url=document.URL.replace('default.aspx','');var number=url.replace('https://idm.euro-fusion.org/?uid=','');var author_tag=document.querySelector("[name=author]");var author=author_tag==null?"":author_tag.content;var today=new Date;var urldate=jsDate2bibTex(today);var publishedTime=(_document$querySelect=document.querySelector('meta[property="article:published_time"'))===null||_document$querySelect===void 0?void 0:_document$querySelect.getAttribute("content");var someTimeTag=(_document$querySelect2=document.querySelector("time[datetime]"))===null||_document$querySelect2===void 0?void 0:_document$querySelect2.getAttribute("datetime");var someTimeTagWithoutDatetime=(_document$querySelect3=document.querySelector("time"))===null||_document$querySelect3===void 0?void 0:_document$querySelect3.innerHTML;var lastModifiedTime=document.lastModified;var pageTime=new Date(publishedTime||someTimeTag||someTimeTagWithoutDatetime||lastModifiedTime);var date=jsDate2bibTex(pageTime);var year=date2YearTex(pageTime);var title_key=title.replace(/[^0-9a-z]/gi,"");var citationKey='TBD';var type="@techreport";var filename=":./references/"+window.location.pathname.slice(1).replace(/\//g,"-")+".html:html";var bibTexEntry=type+" {"+citationKey+",\n  title = {"+text2latex(title_tex)+"},\n "+(author?"  author = {"+author+"},\n":"")+" ,\n number = {"+number+"},\n  institution = {{Research Institute X}},\n  url = {"+url+"},\n  Accessed = {"+urldate+"}\n}";copyToClipboard(bibTexEntry)})();

should do it.

There's a few unused variables in that script. I've not touched them, but I don't see how they could be doing anything.

@retorquere
Copy link
Owner

First, I don't know what you changed in this last version

literally only added a single line of logging

-      if (err.source) item.note += `<pre>${escape.html(err.source)}</pre>`
+      if (err.source) {
+        item.note += `<pre>${escape.html(err.source)}</pre>`
+        Zotero.debug(`import error: ${err.message}\n>>>\n${err.source}\n<<<`)
+      }

@tplobo
Copy link
Author

tplobo commented Jan 16, 2023

Hi @retorquere!

literally only added a single line of logging

First, about the modifications in the last version. I don't know what happened, but I did have to restart Zotero twice more before sending you the latest logs with build 3638. So maybe the single Zotero restart with build 3637 was not enough for some reason. Cached settings, maybe? Don't know, but it is working just fine now, thanks again.

should do it.
There's a few unused variables in that script. I've not touched them, but I don't see how they could be doing anything.

Second, about the bookmarklet modifications: thanks a lot for the help! I tried to use your improved version, but it was missing the var title_tex definition command. I've fixed it and now it works perfectly! Thanks again.

I'll copy it down here for completeness:

javascript:"use strict";(function(){const mapping={'#':'\\#','$':'\\$','%':'\\%','&':'\\&','/':'\\slash','<':'$<$','>':'$>$','\\':'\\textbackslash','^':'\\^',_:'\\_','{':'\\{','}':'\\}','~':'\\textasciitilde'};function text2latex(t){return t.replace(/./g,c=>mapping[c]||c)}var _document$querySelect,_document$querySelect2,_document$querySelect3;function copyToClipboard(text){window.prompt("Copy to clipboard: Ctrl+C, Enter",text)}function jsDate2bibTex(date){var dd=date.getDate();var mm=date.getMonth()+1;var yyyy=date.getFullYear();if(dd<10){dd="0"+dd}if(mm<10){mm="0"+mm}return yyyy+"-"+mm+"-"+dd}function date2YearTex(date){var yyyy=date.getFullYear();return ""+yyyy}var title=document.title;var url=document.URL.replace('default.aspx','');var number=url.replace('https://website.org/?uid=','');var author_tag=document.querySelector("[name=author]");var author=author_tag==null?"":author_tag.content;var today=new Date;var urldate=jsDate2bibTex(today);var publishedTime=(_document$querySelect=document.querySelector('meta[property="article:published_time"'))===null||_document$querySelect===void 0?void 0:_document$querySelect.getAttribute("content");var someTimeTag=(_document$querySelect2=document.querySelector("time[datetime]"))===null||_document$querySelect2===void 0?void 0:_document$querySelect2.getAttribute("datetime");var someTimeTagWithoutDatetime=(_document$querySelect3=document.querySelector("time"))===null||_document$querySelect3===void 0?void 0:_document$querySelect3.innerHTML;var lastModifiedTime=document.lastModified;var pageTime=new Date(publishedTime||someTimeTag||someTimeTagWithoutDatetime||lastModifiedTime);var date=jsDate2bibTex(pageTime);var year=date2YearTex(pageTime);var title_key=title.replace(/[^0-9a-z]/gi,"");var citationKey='TBD';var type="@techreport";var filename=":./references/"+window.location.pathname.slice(1).replace(/\//g,"-")+".html:html";var title_tex=title.replace(/\u00e4/g,'\\"a').replace(/\u00c4/g,'\\"A').replace(/\u00f6/g,'\\"o').replace(/\u00d6/g,'\\"O').replace(/\u00fc/g,'\\"u').replace(/\u00dc/g,'\\"U').replace(/\u00DF/g,'\\"s');var bibTexEntry=type+" {"+citationKey+",\n  title = {"+text2latex(title_tex)+"},\n "+(author?"  author = {"+author+"},\n":"")+" ,\n number = {"+number+"},\n  institution = {{Research Institute X}},\n  url = {"+url+"},\n  Accessed = {"+urldate+"}\n}";copyToClipboard(bibTexEntry)})();

Unfortunately, there is one problem still with the "Import from Clipboard" function. The field "Accessed" gets written as tex.accessed in Extra, instead of filling the correct field.
I tried modifying the bookmarklet to write AccessDate, accessDate and access, but the date is still in the wrong field.
JFYI: the field "Accessed" exists for the techreport item type, so the import function should be able to locate it, no?

If you would prefer me to open a second issue instead, let me know and I'll transfer this last bit.

Thanks again,

@github-actions github-actions bot reopened this Jan 16, 2023
@retorquere
Copy link
Owner

Second, about the bookmarklet modifications: thanks a lot for the help! I tried to use your improved version, but it was missing the var title_tex definition command. I've fixed it and now it works perfectly! Thanks again.

I don't think that would work, you're applying my text2latex after doing your own tex conversions, and those backslashes that end up in title_tex would end up escaped. tltle_tex should be removed entirely.

javascript:"use strict";(function(){const mapping={'#':'\\#','$':'\\$','%':'\\%','&':'\\&','/':'\\slash','<':'$<$','>':'$>$','\\':'\\textbackslash','^':'\\^',_:'\\_','{':'\\{','}':'\\}','~':'\\textasciitilde'};function text2latex(t){return t.replace(/./g,c=>mapping[c]||c)}var _document$querySelect,_document$querySelect2,_document$querySelect3;function copyToClipboard(text){window.prompt("Copy to clipboard: Ctrl+C, Enter",text)}function jsDate2bibTex(date){var dd=date.getDate();var mm=date.getMonth()+1;var yyyy=date.getFullYear();if(dd<10){dd="0"+dd}if(mm<10){mm="0"+mm}return yyyy+"-"+mm+"-"+dd}function date2YearTex(date){var yyyy=date.getFullYear();return ""+yyyy}var title=document.title;var url=document.URL.replace('default.aspx','');var number=url.replace('https://website.org/?uid=','');var author_tag=document.querySelector("[name=author]");var author=author_tag==null?"":author_tag.content;var today=new Date;var urldate=jsDate2bibTex(today);var publishedTime=(_document$querySelect=document.querySelector('meta[property="article:published_time"'))===null||_document$querySelect===void 0?void 0:_document$querySelect.getAttribute("content");var someTimeTag=(_document$querySelect2=document.querySelector("time[datetime]"))===null||_document$querySelect2===void 0?void 0:_document$querySelect2.getAttribute("datetime");var someTimeTagWithoutDatetime=(_document$querySelect3=document.querySelector("time"))===null||_document$querySelect3===void 0?void 0:_document$querySelect3.innerHTML;var lastModifiedTime=document.lastModified;var pageTime=new Date(publishedTime||someTimeTag||someTimeTagWithoutDatetime||lastModifiedTime);var date=jsDate2bibTex(pageTime);var year=date2YearTex(pageTime);var title_key=title.replace(/[^0-9a-z]/gi,"");var citationKey='TBD';var type="@techreport";var filename=":./references/"+window.location.pathname.slice(1).replace(/\//g,"-")+".html:html";var bibTexEntry=type+" {"+citationKey+",\n  title = {"+text2latex(title)+"},\n "+(author?"  author = {"+author+"},\n":"")+" ,\n number = {"+number+"},\n  institution = {{Research Institute X}},\n  url = {"+url+"},\n  Accessed = {"+urldate+"}\n}";copyToClipboard(bibTexEntry)})();

Unfortunately, there is one problem still with the "Import from Clipboard" function.

This is not the problem BTW. It's just an import error, import from clipboard just calls the importer. The same problem would occur if the same content were imported from a file.

The field "Accessed" gets written as tex.accessed in Extra, instead of filling the correct field.

Translators cannot set the accesseDate field directly. After import, you can right-click the item and select "Better BibTeX" -> "Copy date-added..."

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants