Extract DOI from the current web page URL #1799

mrtcode · 2018-12-14T20:08:55Z

From some URLs DOI can't be correctly extracted, i.e. http://onlinelibrary.wiley.com/journal/10.1111/%28ISSN%291470-9856/issues would result to 10.1111/%28ISSN%291470-9856/issues instead of 10.1111/%28ISSN%291470-9856.

Some URLs can also have multiple DOIs i.e. http://api.crossref.org/works/?filter=doi:10.1117/3.1002595.ch10,doi:10.3403/00522251u,doi:10.3403/00522251,doi:10.3403/30217493,doi:10.3403/30289582,doi:10.1117/12.939903,doi:10.3403/02454346u,doi:10.1364/ofc.1979.thf1,doi:10.5772/7558,doi:10.3758/BF03202760,doi:10.3758/bf03195760,doi:10.1006/jmla.1997.2532,doi:10.1037/h0082866.

dstillman · 2018-12-14T20:19:28Z

Shouldn't we return document instead of multiple when there's a single DOI? Even if it's invalid or not found, there's still no reason to show the Select Items dialog. (That would only make sense if it could actually result in a different item from the one you were expecting.)

mrtcode · 2018-12-14T20:26:27Z

Do we want to change this behavior for DOIs that are scraped from document too? Why then it was set to return multiple?

dstillman · 2018-12-14T20:30:23Z

No. In the page we don't know if it describes the main item for the page. In the URL we do.

mrtcode · 2018-12-17T10:42:06Z

If a random string, which looks like a DOI, is extracted from a URL, it would prevent the further DOIs from document extraction. Which means if we want to return a single item we have to firstly resolve DOI metadata to make sure it's valid. And if not then extract and resolve items from a document. The current commit just puts all DOIs into one list, where all items are resolved and user just needs to select which one he thinks is correct. To be able to separately resolve DOI extracted from URL and from document, we would probably need serious modifications of the translator.

dstillman · 2018-12-17T10:53:19Z

DOI.js

+	var dois = [], m;
+
+	// Extract DOIs from the current URL
+	var rx = /10.[0-9]{4,}?\/[^\s&"'?#,]*[^\s&"'?#\/.,]/g;


We usually use re for this, not rx.

Also, seems like we don't need to exclude quotes in this context.

And maybe we do need to URL decode? I'm not sure whether a URL that passes through here is necessarily decoded.

DOI.js

mrtcode · 2019-01-10T13:54:32Z

So as I said previously, it's not that rare to encounter web page URLs that contain a DOI, but we can't extract it reliably. For example:

URL: http://iopscience.iop.org/article/10.1088/0004-637X/768/1/87
Extracted DOI: 10.1088/0004-637X/768/1/87
Actual DOI: 10.1088/0004-637X/768/1/87

URL: http://journal.frontiersin.org/article/10.3389/fmicb.2014.00402/full
Extracted DOI: 10.3389/fmicb.2014.00402/full
Actual DOI: 10.3389/fmicb.2014.00402

And we can't do anything about this.

DOI from URL can be extracted incorrectly but it applies not only for the web page URL, but also for URLs found in the body.

For DOI(s) extracted from a web page URL there are a few possible outcomes:

It's correct and results to correct metadata
It's incorrect and results to no metadata
It's correct but results to no metadata (DOI RAs don't have it, i.e. JSTOR)
It's incorrect and results to incorrect metadata - is not possible, I would say. Except maybe in same rare cases it can result to the actual journal instead of the article

So the translator should work like this, depending on where and how many DOIs were found:

one in URL - single
one in URL, one in body and they are equal - single
one in body - multiple
in all other cases - multiple

dstillman · 2019-01-11T10:11:06Z

URL: http://iopscience.iop.org/article/10.1088/0004-637X/768/1/87
Extracted DOI: 10.1088/0004-637X/768/1/87
Actual DOI: 10.1088/0004-637X/768/1/87

Is this what you meant? These are the same.

URL: http://journal.frontiersin.org/article/10.3389/fmicb.2014.00402/full
Extracted DOI: 10.3389/fmicb.2014.00402/full
Actual DOI: 10.3389/fmicb.2014.00402

And we can't do anything about this.

We could try stripping some likely suffixes, like /full and /pdf.

adam3smith · 2019-01-11T14:30:48Z

We could try stripping some likely suffixes, like /full and /pdf

given the most common academic CMS's, /full$ /pdf$
(as per the above examples)
/abstract$ and /abs$ (speculating about these -- I've mainly seen them before the DOI where they're no problem)

(edit: removed the ones already covered by the regex)

mrtcode · 2019-01-11T18:31:14Z

URL: http://iopscience.iop.org/article/10.1088/0004-637X/768/1/87
Extracted DOI: 10.1088/0004-637X/768/1/87
Actual DOI: 10.1088/0004-637X/768/1/87

Is this what you meant? These are the same.

Yeah, I just wanted to demonstrate the difference between the two URLs.

URL: http://journal.frontiersin.org/article/10.3389/fmicb.2014.00402/full
Extracted DOI: 10.3389/fmicb.2014.00402/full
Actual DOI: 10.3389/fmicb.2014.00402
And we can't do anything about this.

We could try stripping some likely suffixes, like /full and /pdf.

Worth to investigate this idea, but there can be many variants. More examples:
http://www.oxfordreference.com/view/10.1093/acref/9780199608218.001.0001/acref-9780199608218
http://onlinelibrary.wiley.com/journal/10.1111/%28ISSN%291470-9856/issues
http://iopscience.iop.org/article/10.1088/0022-3727/34/10/311/meta

zuphilip · 2019-01-13T14:48:39Z

one in URL, one in body and they are equal - single

I think you can here also test wether the one in the body is the beginning part of the one in the URL. This would be IMO more stable and do (at least for all the examples here) the same as deleting /full or similar postfixes. If a publisher is making the URLs according to the DOIs then I expect also that he is mentioning this DOI in the body text. Therefore I would suggest to make this instead and not additional to deleting anything at the end.

in all other cases - multiple

How about the case: There are multiple DOIs in the body, one in the URL which matches the first one in the body. Should we then just go for this one, i.e. a single case, e.g. https://olh.openlibhums.org/article/10.16995/olh.46/ (there is another DOI in the reference section at the bottom). The same would be true for the Frontiers paper. There is the drawback that then it is not possible to save all references with DOI instead of the article from pages like https://www.frontiersin.org/articles/10.3389/fmicb.2014.00402/full#h12 . But is this really a "feature" we need? (Maybe a hidden preference to toggle it on/off would be enough.) CC @adam3smith

zuphilip · 2019-01-13T14:54:54Z

DOI.js

+	{
+		"type": "web",
+		"url": "https://zotero.org/?d=10.7208/chicago/9780226924632.001.0001",
+		"items": "multiple"
 	}


Can you also add https://onlinelibrary.wiley.com/doi/abs/10.1002/(SICI)1099-1360(199711)6:6%3C320::AID-MCDA164%3E3.0.CO;2-2 as a test case?

dstillman · 2019-01-13T22:24:34Z

There are multiple DOIs in the body, one in the URL which matches the first one in the body. Should we then just go for this one

Yes.

There is the drawback that then it is not possible to save all references with DOI instead of the article

In #1092 (comment) I suggested a runtime flag:

With a combined translator and auto DOI matching, we'll also need a translation flag to force use of Select Items for DOIs even when one matches the page, so that we can still offer a DOI option in the context menu.

So it would be something like forceMultiple: true that got passed to translate() and was available from the translator (in configOptions?).

mrtcode · 2019-01-14T12:29:51Z

one in URL, one in body and they are equal - single

I think you can here also test wether the one in the body is the beginning part of the one in the URL. This would be IMO more stable and do (at least for all the examples here) the same as deleting /full or similar postfixes. If a publisher is making the URLs according to the DOIs then I expect also that he is mentioning this DOI in the body text. Therefore I would suggest to make this instead and not additional to deleting anything at the end.

Yeah, the partial DOI matching sounds like a good idea. Just we have to keep in mind that sometimes journal itself can have a DOI which is part of the journal article DOI, but if there is only one DOI in the body it would be very unlikely.

There are multiple DOIs in the body, one in the URL which matches the first one in the body. Should we then just go for this one

Yes.

DOI position in the body is not a very reliable metric. Some more advanced websites are listing metadata of additional articles:

Recommended/related articles
Prev/next article
Cited articles

But generally, I agree that if a DOI from the URL can be reliably matched with a DOI in the body, whether there are one or more of them, we should return a single item.

Extract DOI from the current web page URL

1b6deda

dstillman requested changes Dec 17, 2018

View reviewed changes

Fix regular expression, decode DOI

92f8713

mrtcode force-pushed the extract-doi-from-url branch from b1a50c0 to 92f8713 Compare December 17, 2018 11:14

mrtcode mentioned this pull request Jan 9, 2019

Try translating PDF URLs based on URL zotero/translation-server#70

Open

zuphilip reviewed Jan 13, 2019

View reviewed changes

zuphilip mentioned this pull request Jan 28, 2019

[don't merge] Switch DOI to DOI.org #1135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract DOI from the current web page URL #1799

Extract DOI from the current web page URL #1799

mrtcode commented Dec 14, 2018

dstillman commented Dec 14, 2018

mrtcode commented Dec 14, 2018

dstillman commented Dec 14, 2018

mrtcode commented Dec 17, 2018

dstillman Dec 17, 2018

mrtcode commented Jan 10, 2019 •

edited

Loading

dstillman commented Jan 11, 2019

adam3smith commented Jan 11, 2019 •

edited

Loading

mrtcode commented Jan 11, 2019

zuphilip commented Jan 13, 2019

zuphilip Jan 13, 2019

dstillman commented Jan 13, 2019

mrtcode commented Jan 14, 2019

Extract DOI from the current web page URL #1799

Are you sure you want to change the base?

Extract DOI from the current web page URL #1799

Conversation

mrtcode commented Dec 14, 2018

dstillman commented Dec 14, 2018

mrtcode commented Dec 14, 2018

dstillman commented Dec 14, 2018

mrtcode commented Dec 17, 2018

dstillman Dec 17, 2018

Choose a reason for hiding this comment

mrtcode commented Jan 10, 2019 • edited Loading

dstillman commented Jan 11, 2019

adam3smith commented Jan 11, 2019 • edited Loading

mrtcode commented Jan 11, 2019

zuphilip commented Jan 13, 2019

zuphilip Jan 13, 2019

Choose a reason for hiding this comment

dstillman commented Jan 13, 2019

mrtcode commented Jan 14, 2019

mrtcode commented Jan 10, 2019 •

edited

Loading

adam3smith commented Jan 11, 2019 •

edited

Loading