Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web query returns multiple results unexpectedly #65

Closed
dhimmel opened this issue Dec 18, 2018 · 7 comments
Closed

Web query returns multiple results unexpectedly #65

dhimmel opened this issue Dec 18, 2018 · 7 comments

Comments

@dhimmel
Copy link
Contributor

dhimmel commented Dec 18, 2018

The following query:

curl --silent \
  --data 'https://zietzm.github.io/Vagelos2017/' \
  --header 'Content-Type: text/plain' \
  https://translate.manubot.org/web

Returns multiple results:

{
    "url": "https://zietzm.github.io/Vagelos2017/",
    "session": "kSPz1b6essGWbyC",
    "items": {
        "10.1038/nbt.2786": "Clinical development success rates for investigational drugs",
        "10.1038/534314a": "Can you teach old drugs new tricks?",
        "10.1016/j.jhealeco.2016.01.012": "Innovation in the pharmaceutical industry: New estimates of R&D costs",
        "10.1038/nrd3681": "Diagnosing the decline in pharmaceutical R&D efficiency",
        "10.1016/S0167-6296(02)00126-1": "The price of innovation: new estimates of drug development costs",
        "10.1038/nrd3405": "The productivity crisis in pharmaceutical R&D",
        "10.1016/0167-6296(91)90001-4": "Cost of innovation in the pharmaceutical industry",
        "10.1021/acs.jcim.7b00028": "DeepPPI: Boosting Prediction of Protein\u2013Protein Interactions with Deep Neural Networks",
        "10.1371/journal.pcbi.1004259": "Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes"
    }
}

It looks like https://zietzm.github.io/Vagelos2017/ is being interpreted as a page of search results rather than a citeable work. Interestingly, translation-server does return metadata for https://greenelab.github.io/meta-review/.

So what causes a web URL to be interpreted as containing multiple choices. For our use case, we never want multiple choices. Is there an option to disable multiple choices, such that every web query is considered a single citeable work?

@adomasven
Copy link
Member

If you visit a page with the Zotero Connector, right click on the Connector button and go to the "Save to Zotero" menu you will see the list of translators that Zotero has detected for the page in the order of priority based on the metadata available on the webpage and translators that are applicable to said page. The first translator for the https://zietzm.github.io/Vagelos2017/ page is DOI, which means that the best quality data Zotero can get is to use DOI urls available on the page. The DOI translator always produces a "multiple" choice, even when there is only a single DOI on the page since sometimes a page that you are trying to save has DOIs linking to other items, but not for the item you are looking at and you might want to cancel the translation and save it as webpage instead (which usually yields the poorest metadata out of any translators, but it is still better than saving the page via the DOI translator and getting an item completely unrelated to what you were looking at). The translation server always uses the highest priority translator, which is why you are getting the multiple-choice response.

The true option here would be for these pages to expose metadata in an unambiguous format for Zotero, but you can also force the translation server to skip "multiple" translators by using the query parameter ?single=1. Note that this will likely yield poorer metadata for pages where only a multiple option is available.

@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 18, 2018

Thanks @adomasven for the helpful explanation. I see that https://greenelab.github.io/meta-review/ saves to Zotero using "Embedded Metadata", but https://zietzm.github.io/Vagelos2017/ uses DOI as top-priority.

Confirming that the following, which sets single=1, returns a single record of metadata:

curl --silent \
  --data 'https://zietzm.github.io/Vagelos2017/' \
  --header 'Content-Type: text/plain' \
  'https://translate.manubot.org/web?single=1'

Although it seems to me that ?single=0 also has the same effect as ?single=1.

Note that this will likely yield poorer metadata for pages where only a multiple option is available.

I can't seem to find any webpages with a single DOI where Zenodo Connector selects the DOI translator as top priority (instead it selects Embedded Metadata). I am curious whether translation server still returns a top level JSON object (i.e. multiple records) or array (i.e. single record) when a single DOI item is found.

I am wondering whether we should first make our web query without single=1, but then fallback to single=1 if multiple results are returned. Or should we just always specify single=1? How common are translators that return multiple results?

The true option here would be for these pages to expose metadata in an unambiguous format for Zotero

Certainly. We use pandoc to output our HTML webpages... and will look into making sure the best metadata is getting set for citations. But that is a separate issue, since we most web pages are not under our control.

@dstillman
Copy link
Member

I can't seem to find any webpages with a single DOI where Zenodo Connector selects the DOI translator as top priority (instead it selects Embedded Metadata).

It won't. Priority is at the level of translators, and EM has a higher priority than DOI. But we're currently working on a combined translator that will intelligently choose between the different generic formats available.

I am curious whether translation server still returns a top level JSON object (i.e. multiple records) or array (i.e. single record) when a single DOI item is found.

Multiple.

I am wondering whether we should first make our web query without single=1, but then fallback to single=1 if multiple results are returned.

Any page with just DOIs and no embedded metadata will return multiple, so if you want to support those properly at the moment, you should display a selection interface. single=1 is really for cases where there's no way to present an option to the user, and if you use it whenever there's a single result, you'll get much worse data than necessary when there's a single DOI that does match the page.

In the combined generic translator, we'll probably resolve the first DOI result and try to determine if it matches the current page, and if so just return that automatically. (We'd probably need a flag to force it to multiple so that people could still get the selection window in the client when choosing the DOI translator from the context menu.)

Your example page is tricky, though, because the DOI doesn't resolve, so single=1 is in fact the only way to save that page. (In the client, the equivalent is choosing the webpage option from the context menu.) There's not really a great solution for that now, but one option would be a separate button in the selection interface to save as a webpage, which would run again with single=1. Looking ahead, the combined translator should give better multiple options that always include some version of the current page.

@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 19, 2018

Your example page is tricky, though, because the DOI doesn't resolve

For the record, the situation with https://doi.org/10.6084/m9.figshare.5346577 is slightly more complex. The DOI is registered with DataCite and does resolve. However, currently DataCite's DOI Content Negotiation service is down (see manubot/manubot#89).

However, let's assume the DOI metadata retrieval did work, I don't actually think we'd want to return that metadata. If the user did want the DOI metadata, they could cite the DOI directly. However, this is specific to our use case, where translation-server is called in the backend to get metadata for an in-text citation string. Returning a single result is required here, so I've opened a PR at manubot/manubot#90 to do that!

It seems like most of the downsides to single=1 don't apply to the types of URLs Manubot users would cite, which should be single works that likely don't have DOIs and hence are cited by URL.

I'll keep an eye on the generic translator progress in zotero/translators#1092.

@dstillman
Copy link
Member

However, let's assume the DOI metadata retrieval did work, I don't actually think we'd want to return that metadata. If the user did want the DOI metadata, they could cite the DOI directly.

I'm not sure what you mean by that. The data available from the publisher page and the data registered for the DOI should be equivalent. They're sometimes not, but that's usually just because one or the other isn't up to date — the understanding in Zotero is that you should get more or less the same results regardless of whether you add an item by its DOI (using Add Item by Identifier, which is equivalent to /search) or by going to its webpage and clicking the save button (equivalent to /web). Among other things, the DOI metadata should include the publisher URL, and the data from the publisher's page should include the DOI.

It seems like most of the downsides to single=1 don't apply to the types of URLs Manubot users would cite, which should be single works that likely don't have DOIs and hence are cited by URL.

Why would cited works likely not have DOIs? In any case, because the DOI translator has the lowest priority of any translator, single=1 forces the use of the most bare-bones data: title, URL, and access date. If there's a chance a DOI could be used, you absolutely want to use that instead.

Again, the way Zotero translators work, a user shouldn't really need to distinguish between a URL and a DOI. You don't "cite a URL" or "cite a DOI" in Zotero. They're just two identifiers for the same thing. There are currently some practical differences in terms of what you might get depending on what you use and the exact source of the data, but there's no particular pattern, and we're constantly trying to reduce those differences.

@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 20, 2018

Sorry @dstillman that I wasn't super clear about our use case for translation-server. We're using translation-server to retrieve metadata for persistent identifiers, which is an application that is not related to Zotero. Our application is much more similar to https://zbib.org than the Zotero Connector.

Specifically, a Manubot user will write a sentence like:

Here is a sentence with several citations [
  @doi:10.15363/thinklab.4;
  @pmid:26158728;
  @arxiv:1508.06576;
  @isbn:9780394603988;
  @wikidata:Q50051684;
  @url:https://nyti.ms/1QUgAt1
].

The manubot software will then retrieve metadata for all of those citations and create the bibliography according to the user-specified CSL style. Currently, we use translation-server to retrieve metadata for the ISBN, Wikidata, and URL citations. Our usage guide recommends citing DOIs over URLs, such that when one cites a URL, we can assume there is not a suitable DOI substitute.

Since our use case assumes each citation points to a singular citeable work, we need to make sure translation-server returns a single result. One option would be to allow multiple records iff the length is 1. If length > 1, fallback to single=1. However, that is more complicated and I haven't seen any examples or URLs where single=1 does not do what we want.

Does that explanation help clear things up?

@dstillman
Copy link
Member

Right, I understand all that. I'm just saying that both DOIs and publisher URLs can be considered persistent identifiers, and Zotero translators will ideally get high-quality metadata from either, so while it's fine to recommend using DOIs, it doesn't need to be a strict requirement if you don't want to trouble people with the distinction. And for what it's worth, we choose not to: you mention ZoteroBib, but there, just as in Zotero, you can add items via DOI or URL to cite the same thing (and if there's ever ambiguity, as with DOIs from the page, we show a selection interface). Most DOI metadata is retrieved via APIs, which is an advantage, but there are times when the DOI metadata is outdated and the publisher site offers better data. (We're hoping to move in a direction where you get the best possible metadata wherever you save from.)

However, that is more complicated and I haven't seen any examples or URLs where single=1 does not do what we want.

Well, to go back to your original example, with single=1 you get fairly useless metadata: title, accessDate, and URL, and you'll never get more than that. If the DOI were resolving properly, you'd get much better metadata if it used the DOI — either by letting the user confirm the selection, as might be necessary now, or automatically, if a combined generic translator checked the resolved DOI metadata against the page and determined that it matched the document.

In any case, you can certainly use single=1, particularly if you're advising people to always prefer DOIs. Just understand that there can be cases where it results in worse metadata than could be retrieved from information on the page.

dhimmel added a commit to manubot/manubot that referenced this issue Dec 31, 2018
Merges #90

translation-server web query: specify single=1 to use only translators
that return a single result. Currently, this disables the DOI web translator.
Refs zotero/translation-server#65

Remove child notes in Zotero data. Keep only first item of a Zotero
data list (JSON array). See _passthrough_zotero_data.
Refs zotero/translation-server#67

Improve export_as_csl error handling. When an error status code is returned,
provide a more specific exception rather than JSON parsing failed.

translation-server test coverage improvements:
* test that web_query returns single result
* test ISBN which previously failed due to containing child notes 
* test_zotero: test all ID types for /search
* get_url_citeproc_zotero: test manubot & GitHub URLs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants