-
Notifications
You must be signed in to change notification settings - Fork 761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[don't merge] Switch DOI to DOI.org #1135
Conversation
@zuphilip -- this obviously needs cleanup, but I wanted to get your thoughts on this. It's taking advantage of the now available CSL_JSON via content negotiation on doi.org (and the functionality you added to ZU.doGet to use it. The main advantages are 1) Fewer API requests 2) Works in Chrome connector for items with DataCite and CrossRef DOIs 3) Will work with additional services as they become available Open questions: 1) What's a good name for the new translator 2) Should we remove the old DataCite and CrossRef ones, or leave them in as a fallback at least initially? 3) Any thoughts on how to test this for data quality? I was going to manually go through all DOIs that are in the CrossRef translator. Any other ideas? As well as any other conceptual issues you can think of (as I say, don't worry about code quality for now)
Cool! I also wanted to look at this anyway at some point... At
I like DOI Content Negotiation. 👍 It is a search translator and can thereby also been used for the doi search from Zotero itself, i.e. the "magic" wand. I don't know if this effects the discussion here.
If everything works out fine, then we should not (strictly) need the single search translators anymore. However, this feature is only available from Zotero
You can create a gist with the examples and then test the web translator (which will implicitly test the search translator), e.g. https://gist.github.com/zuphilip/9e5101219f453964eb7e . It would also be good to have more information when the translation is failing for the users, i.e. the things we see at https://repo.zotero.org/errors. a) Did you test this translator with Chrome connector already? Let me know if I can help you further. |
This now works decently well
To your questions: More generally, in doing some quality testing, I'm finding that metadata for books and book chapters in CrossRef are of much lower quality in the CSL JSON than in their UnixRef format. I have contacted them about this and will see what they say. If improvements are not in the card soon, we can still handle all of this in a single translator using content negotation: we can specify an order of preferred formats in the accept header and can just put UnixRef and Datacite XML in there, then check which we get (which should be easy) and run the respective parsing code. |
DOI Content Negotiations.js
Outdated
item.genre = item['container-title'] | ||
} | ||
if (item.type != "article-journal" && item.type != "paper-conference") { | ||
item.note = "DOI: " + item.DOI; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should actually go into extra
field.
@adam3smith Any update on this matter? Morveover, testing in 5.0 would be needed now... @adam3smith And just checking: Is this change still what we want? |
I definitely still want to do this, yes, and I'd be surprised if it didnt' work in 5.0. |
@adam3smith: Somewhat tangential, but just to understand this better, for the Zenodo example you gave in the forums, 10.5281/zenodo.889811, it seems the data from content negotiation is different from the metadata available on zenodo.org, even in the same format:
vs. https://zenodo.org/record/889811/export/csl
Seems like the former is preprint data that was never updated? Is that common? I was looking because I was curious how complete CSL-JSON data is compared to other formats. In this case that's not relevant, since they're both CSL-JSON, but are there fields we might care about that don't exist in CSL-JSON? (I didn't figure out any other formats that worked via CN for that DOI anyway, so perhaps irrelevant…) |
I don't think this is what you see here. I guess that the difference comes from the DataCite metadata schema which misses some details which are commons for journal articles, e.g. no journal title, volume, issue, pages. If you start with less information to create a CSL-JSON at DataCite than it understandable that this results in a different CSL-JSON than when you start with more information at Zenodo. |
yes, @zuphilip has that exactly right -- this is caused by back and forth translation to/from an unsuitable schema. I've been talking to DataCite and they're very likely going to update their schema to better accommodate articles in the future. |
You can see the full entry into the datacite catalog using
The obvious things we're not getting from CSL JSON are subject (tags) and license (rights) |
Reviewing with @zuphilip -- we're currently thinking that to get the best result, we'll want to
This will currently work for mEDRA, Japan Link Center, and Kistic. crosscite.org suggests it should also work for ISTIC, but that's currently not the case. Additional idea for the longer term would be to check if the EU publication office (which doesn't support CN) has a very limited set of prefixes. If that's the case, we could follow the URL and import from the landing page which has OK metadata. Not implementing this isn't a showstopper -- we currently fail on OP DOIs, too. |
For the call URL we can use something like
and for differentiation of the different JSON formats, DataCite includes
which CSL JSON does not. |
Delete Datacite & Crossref
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are some comments about the code to clean it up. Some easier things I suggested directly in my PR adam3smith#11 to your branch. In general this looks good. This PR now depends that we first merge #1812 or merge that first into this branch. I only tested a few cases and try to do some more tests, but they work fine.
DOI Content Negotiations.js
Outdated
Z.debug(item) | ||
if (item.itemType == "report") { | ||
item.reportType = item.publicationTitle; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we do these fixes here? I would rather suggest to do these fixes in the CSL JSON translator... If possible then these three cases would just differentiate by the different translator we choose.
DOI Content Negotiations.js
Outdated
"url": "http://dx.doi.org/10.12763/ONA1045", | ||
"DOI": "10.12763/ONA1045", | ||
"date": "1734", | ||
"libraryCatalog": "DataCite" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the test here.
You do this manually, right?
E.g. 10.1023/A:1005640604267 or 10.4086/cjtcs.2012.002
Okay, I tested with 140 Crossref DOIs and 13 Datacite DOIs found in My Library. All Datacite DOIs are working, but two Crossref DOIs failed because the title was empty (fixed in PR) and two other DOIs failed where I don't know why exactly: |
Update and simplify
@zuphilip SICIs (your failing examples, e.g. 10.1002/(SICI)1097-461X(1999)72:4<287::AID-QUA11>3.0.CO;2-Y) are a pain to work and break many things. Luckily there a child of the 1990s and no new ones are generated (Crossref stopped supporting them at some point I think). |
Update tests
This is working nicely for me now & a clear improvement over current behavior. It's a major change, though, so would be good to get another set of eyes on this before merging. |
But if I stash, I cannot lint the translator. Or do you mean to stash on the branch and then pop on master? |
Yes, some version of that. |
When I run the current state of this DOI translator, the tests time out without completing. |
You included the other translators in the translator directory? Pretty sure I tested this last night. |
Ah -- no, I did not. |
The deletions are only relevant if you're going to test add by identifier, but you'll need all added&changed files for the DOI.js tests |
I think I have the PR-on-PR submitted. |
Yup! I'll look more closely tonight, but generally looks great, thanks. I have one question right away -- this was a part of the code that I didn't quite understand and I think your code does something slightly different -- which may be fine, but I want to understand what we were thinking originally. |
I don't know -- my guess was that at some point, when there was just one, the selector wasn't shown (edit: this was indeed the case, but this was removed in 569df5d while the check stayed in place). But I didn't want to change any more than required to pacify the linter. |
When I go to https://www.ncbi.nlm.nih.gov/pubmed/?term=Recent+Trends+in+Advanced+Contact+Lenses.+Advanced+Healthcare+Materials and fetch using the DOI translator, I don't get the abstract, but if I follow the DOI, I end up at https://onlinelibrary.wiley.com/doi/full/10.1002/adhm.201801390, and when I fetch there, I do get an abstract. |
The two access entirely different things: When you use the DOI translator, it requests metadata for the DOI via content negotiation (i.e. specifying the metadata format in the accept header of the request) and the metadata is then provided by Crossref (or another DOI registration agency, but mostly Crossref for articles). Very few articles on Crossref include abstracts in the metadata. When you go straight to Wiley, you use the dedicated translator for Wiley, which uses the metadata provided by Wiley in its export functions (and may also try to scrape the abstract from the page). But of course prior to following the DOI, you don't know which publisher/site you'll land on, so you can't specify the translator. Because of that, within the current translator framework, getting the abstract isn't possible. I think it'd be conceptually doable to follow the DOI resolution (i.e. in this case to Wiley), see if we have a translator for that page, and if so, import using that translator (and otherwise fall back to using DOI metadata), but whether and how to do that is a more involved discussion, probably better for zotero-dev. Among its downsides is that it's much slower since you're making multiple calls that require full pageloads now rather than a lightweight call to an API. |
treewalker is substantially faster than xpath
OK, the only linting error remaining are the |
It's possible to lift the handlers out of the loop but I feel it'd be less readable. The other possibility I thought might work was using Promise.all, but I've tested and I couldn't get it to work. |
Yeah, agree on not moving the handlers out of the loop, that makes no sense conceptually. |
Yeah, that's fine. |
Alright! I think this is ready to merge. Will do so tomorrow PM until I hear any objections. Thanks everyone for help & advice ! |
Just as a general point of interest -- can someone tell my why this promisified version doesn't work? The DOI scraping works of course, but in Scaffold I get
|
Web translators don't properly support promises yet. We'll have proper async/await support in the near future. |
Related: zotero/zotero#1609 |
sorry, I think I said something to the contrary above, having remembered Dan's work on import translators. |
Alright, here we go! Thanks everyone! |
@zuphilip -- this obviously needs cleanup, but I wanted to get your
thoughts on this. It's taking advantage of the now available CSL_JSON
via content negotiation on doi.org (and the functionality you added to
ZU.doGet to use it. The main advantages are
Open questions:
as a fallback at least initially?
manually go through all DOIs that are in the CrossRef translator. Any
other ideas?
As well as any other conceptual issues you can think of (as I say, don't
worry about code quality for now)