Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedded Metadata and HighWire fixes for preprint type (#3137) #3146

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

zoe-translates
Copy link
Collaborator

  • In Embedded Metadata, usually the HighWire-determined item type is
    preferred. However, HW is not known to handle preprints distinctly
    from (published) articles. In fact, in the HW translator preprints are
    handled manually for special cases (bioRxiv/medRxiv). Therefore, in
    EM, if type determined by non-HW already says "preprint", don't let HW
    override that type. Especially, this keeps exports.itemType respected
    (e.g. set by a translator that calls EM).
  • In EM translator, if we have determined the type to be "preprint", use
    "Preprint PDF" as the PDF attachment name, rather than "Full Text
    PDF".
  • In HW2.0 translator, make the bioRxiv/medRxiv special-case code a bit
    easier to maintain, by
    1. making an explicit "isBioMedRxiv()" function,
    2. avoiding duplicated conditionals testing bioRxiv/medRxiv,
    3. explicitly pass detected itemType to EM translator
  • In HW2.0, for bioRxiv/medRxiv, delete "pages" field, which is almost
    always an artifact arising from malformed HW metadata. This prevents
    it from going into the extra.
  • Make the HW2.0 scrape code async.
  • Update HW2.0 test cases.

In addition, failure to detect "multiple" by HW2.0 is addressed by improving the fallback-to-multiple logic and fixing the argument list of the call to getSearchResults()

Fixes #3137

- In Embedded Metadata, usually the HighWire-determined item type is
  preferred. However, HW is not known to handle preprints distinctly
  from (published) articles. In fact, in the HW translator preprints are
  handled manually for special cases (bioRxiv/medRxiv). Therefore, in
  EM, if type determined by non-HW already says "preprint", don't let HW
  override that type. Especially, this keeps exports.itemType respected
  (e.g. set by a translator that calls EM).
- In EM translator, if we have determined the type to be "preprint", use
  "Preprint PDF" as the PDF attachment name, rather than "Full Text
  PDF".
- In HW2.0 translator, make the bioRxiv/medRxiv special-case code a bit
  easier to maintain, by
  1) making an explicit "isBioMedRxiv()" function,
  2) avoiding duplicated conditionals testing bioRxiv/medRxiv,
  3) explicitly pass detected itemType to EM translator
- In HW2.0, for bioRxiv/medRxiv, delete "pages" field, which is almost
  always an artifact arising from malformed HW metadata. This prevents
  it from going into the extra.
- Make the HW2.0 scrape code async.
- Update HW2.0 test cases.
@zoe-translates
Copy link
Collaborator Author

I need a bit more thoughts on this. Tried to convert this to draft but browser/github isn't letting me do so.

@dstillman dstillman marked this pull request as draft September 26, 2023 11:37
@AbeJellinek
Copy link
Member

if we have determined the type to be "preprint", use "Preprint PDF" as the PDF attachment name, rather than "Full Text PDF".

If we do this, we should update arXiv, Preprints.org, etc., as well.

@zoe-translates
Copy link
Collaborator Author

Well, I guess "full text" (or "full-text" as an adjective) simply means that: it's a text in its entirety, as opposed to an abstract, a summary, or an abridgement. So we may even say "full-text preprint" which means a preprint as a whole (for instance, "medRxiv launches full-text HTML of preprints online"). The NASA ADS lists both the VoR and the corresponding arXiv preprint PDF files under "full text sources" e.g. on the upper right of this page: 2010ApJ...725.2324B

So on second thought, I think while useful, the distinction of "preprint PDF" vs. "full-text PDF" isn't that clear-cut. It's just that "full text PDF" is less specific when the file is indeed a preprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

BioRxiv: Use attachment title that conveys that file is a preprint
2 participants