Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BioRxiv: Use attachment title that conveys that file is a preprint #3137

Open
dstillman opened this issue Sep 16, 2023 · 7 comments · May be fixed by #3146
Open

BioRxiv: Use attachment title that conveys that file is a preprint #3137

dstillman opened this issue Sep 16, 2023 · 7 comments · May be fixed by #3146
Assignees

Comments

@dstillman
Copy link
Member

https://forums.zotero.org/discussion/comment/443647/#Comment_443647

@zoe-translates zoe-translates self-assigned this Sep 16, 2023
@zoe-translates
Copy link
Collaborator

I think there may be a case for a dedicated bioRxiv translator, if just to take advantage of its JSON API, which has lower overhead for bulk saving (multiple). But then again, the current generic translator (Highwire 2.0, which in turn uses EM) already works; it's just the title of the attached PDF file that's unsatisfactory.

One thing we can do to possibly improve the baseline situation is to add a predicate on the itemType here:

newItem.attachments.push({ title: "Full Text PDF", url: pdfURL, mimeType: "application/pdf" });

If itemType is "preprint", we get a title string reflecting this. This could be a reasonable default, but this could also be foiled by a preprint hosting service that put the link to external, non-preprint VoR PDF in the metadata if it can find one -- perhaps conceivable, but I'm not aware of any IRL instances of this.

In any case, if we want to get a title string that has "bioRxiv" in it (in the manner of arXiv), perhaps a new bioRxiv translator is warranted?

@dstillman
Copy link
Member Author

Ah, I didn't realize we didn't have a dedicated translator here.

If itemType is "preprint", we get a title string reflecting this.

This is probably reasonable. We don't actually need "bioRxiv" in the name — "Preprint PDF" is fine. (That said, if we could tell what was an "Submitted Version" vs. an "Accepted Version" from the site metadata, that would be an argument in favor of a dedicated translator, since we use those terms for automatically fetched OA PDFs.)

@dstillman
Copy link
Member Author

Or we could only say "Preprint PDF" when the file is hosted on the same domain? Is that the case for all the main preprint servers?

@adam3smith
Copy link
Collaborator

Check OSF based servers. I think they might host the files on an OSF domain even if that's not the preprint server domain. Otherwise I believe yes

@zoe-translates
Copy link
Collaborator

Another feature requiring a dedicated translator is the automatic download of supplementary files (which will require page scraping; I don't think the API tells us anything about supplements but I'll verify).

@zoe-translates
Copy link
Collaborator

For the hypothetical issue of "metadata link to pdf on a preprint-hosting page pointing to non-preprint), now I don't feel OSF sites would be an issue. It seems that the much more likely thing for a preprint service to do is to link the preprint to any external VoR by a permalink or identifier -- like how it is done by arXiv and OSF using DOI link -- rather than link to a specific format. TL;DR it's too hypothetical.

@zoe-translates
Copy link
Collaborator

In fact, in HighWire 2.0 translator we have this

translators/HighWire 2.0.js

Lines 294 to 295 in 8e5c648

if (item.publicationTitle.endsWith('Rxiv')) {
item.itemType = preprintType;

So the ability to handle bioRxiv/medRxiv by HighWire 2.0 translator is sort of a hack. However, even with the hack the EM sub-translator still can't see the type as preprint inside it (hence unable to take advantage of auto naming based on preprint type).

This problem is general, because there's currently no good way to detect preprint type by HW metadata in the EM translator, and any fixes in EM will be special-case code (i.e. domain- or URL-based allowlisting) that should better go into HW translator anyway. It's made worse because EM believes HW type to be of high-accuracy.

So in EM, I adjusted the priority of type determination wrt. HW-derived type: when we can already identify the preprint type by other means, don't let HW override it.

This makes it possible in the HW translator to explicitly pass the itemType property to EM (as exports.itemType) This is what I did in #3146.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants