Skip to content
This repository has been archived by the owner on Sep 9, 2022. It is now read-only.

Warning message: no plugin for Crossref member 345 yet #163

Closed
low-decarie opened this issue May 31, 2018 · 19 comments
Closed

Warning message: no plugin for Crossref member 345 yet #163

low-decarie opened this issue May 31, 2018 · 19 comments
Milestone

Comments

@low-decarie
Copy link

low-decarie commented May 31, 2018

It does not seem to be possible to download full text from "International Journal of Systematic and Evolutionary Microbiology" though most articles are open access.

Warning message:
no plugin for Crossref member 345 yet

I guess similar to previously reported issue of

no plugin for Crossref member 8215 yet #117

example url for DOI: 10.1099/ijs.0.006767-0
main page:
http://ijs.microbiologyresearch.org/content/journal/ijsem/10.1099/ijs.0.006767-0
html
http://ijs.microbiologyresearch.org/content/journal/ijsem/10.1099/ijs.0.006767-0#tab2
ie adding "#tab2" to the url retrieved by dx.doi.org from the DOI find the full html text.

pdf:
http://www.microbiologyresearch.org/docserver/fulltext/ijsem/59/6/1508.pdf? [+++ user specific specific bits]

@sckott
Copy link
Contributor

sckott commented May 31, 2018

thanks for the report @low-decarie

will have a look into this

@sckott
Copy link
Contributor

sckott commented Jun 1, 2018

Hmm, it's kind of messy 😮

So I think we need a pattern like

http://ijs.microbiologyresearch.org/deliver/fulltext/ijsem/<doi suffix dot separated>.zip/<doi suffix concatenated no spaces>.pdf

And then ijsem is also specific to the journal, as well asijs in ijs.microbiologyresearch.org

so I think can be done but a bit messy

@low-decarie
Copy link
Author

I was trying to get an alternative using phantomjs to download the full-text html page after javascript interpretation, but for some reason the html full-text section is still missing. Thanks for your efforts.

@sckott
Copy link
Contributor

sckott commented Jun 1, 2018

okay - try these on the command line

curl https://ftdoi.org/api/doi/10.1099/ijsem.0.002809/ | jq .
curl https://ftdoi.org/api/doi/10.1099/mic.0.000664/ | jq .
curl https://ftdoi.org/api/doi/10.1099/jgv.0.001056/ | jq .
curl https://ftdoi.org/api/doi/10.1099/mgen.0.000182/ | jq .
curl https://ftdoi.org/api/doi/10.1099/jmmcr.0.005152/ | jq .
curl https://ftdoi.org/api/doi/10.1099/jmm.0.000647/ | jq .

and now fulltext should hopefully work on these now, try

remotes::install_github("ropensci/fulltext")
ft_get('10.1099/ijsem.0.002809')

sckott added a commit that referenced this issue Jun 1, 2018
@low-decarie
Copy link
Author

That is fantastic. Thank you!

It worked once, but I now get:

Warning message:
you may not have access to 10.1099/ijs.0.006387-0 or an error occurred

but if I check online, I do have access to 10.1099/ijs.0.006387-0 (and many others for which I get the same error) (its actually open access)

@sckott
Copy link
Contributor

sckott commented Jun 7, 2018

looks like because they use a different URL pattern for that DOI

http://www.microbiologyresearch.org/docserver/fulltext/ijsem/59/8/1919.pdf

whereas we were looking for https://github.com/ropenscilabs/pubpatterns/blob/master/src/microbiology.json#L23

@low-decarie
Copy link
Author

Naive question, in the output from
curl https://ftdoi.org/api/doi/10.1099/ijsem.0.002809/ | jq .
that is being parsed by microbiology.json
is it not possible to have a very/more liberal/non-restrictive search for any url containing pdf?

@sckott
Copy link
Contributor

sckott commented Jun 7, 2018

What do you mean by "any url containing pdf" ? ftdoi API works not by scraping publisher pages, but by using rules described in https://github.com/ropenscilabs/pubpatterns repo - so we only give back URLs from rules that we state ourselves.

or by "any url" did you mean URLs for other articles?

@sckott
Copy link
Contributor

sckott commented Jun 7, 2018

yeah, so publishers really suck. They have a different URL pattern for papers in press vs. papers assigned to a volume/issue/page numbers.

not sure yet how I'll deal with that.

@sckott
Copy link
Contributor

sckott commented Jun 7, 2018

phew, okay i think this works now:

@low-decarie
Copy link
Author

low-decarie commented Jun 8, 2018

I knew my comment was naive, the magic you are doing here escapes me. Thanks again for this great piece of work!

The issues I was having were all with articles that are already in a volume. Articles more than 6 months old become OA and those are the only articles to which I have access / want to dowload.

A PDF file is temporarily created in the cach folder, but it gets deleted when the command fails (I guess this is planned behaviour).

Here is a list of all IJSEM DOI. There are three formats of DOI.
10.1099/ijs.
10.1099/ijsem.
10.1099/00207713

@sckott
Copy link
Contributor

sckott commented Jun 8, 2018

A PDF file is temporarily created in the cach folder, but it gets deleted when the command fails (I guess this is planned behaviour).

yes, we don't want to cache a bad file so we clean it up (delete it) if something goes wrong.

So does ft_get work then for the most part with your DOI list?

@sckott
Copy link
Contributor

sckott commented Jun 11, 2018

@low-decarie does ft_get work then for the most part with your DOI list?

@low-decarie
Copy link
Author

low-decarie commented Jun 12, 2018

If I do ft_get() on the whole list of DOIs, I get :
Error in names(z$data) <- tolower(names(z$data)) : attempt to set an attribute on NULL In addition: Warning message: 404: Resource not found. - (10.1099/ijs.0-011122-0)

If I sample repeatedly 30 DOIs from this list to which I apply ft_get(), I get fails ~29/30 times (eg. of warnings):

Warning messages: 1: you may not have access to 10.1099/ijs.0.64812-0 or an error occurred 2: you may not have access to 10.1099/ijs.0.000090 or an error occurred 3: you may not have access to 10.1099/ijs.0.020628-0 or an error occurred 4: you may not have access to 10.1099/00207713-51-3-731 or an error occurred 5: you may not have access to 10.1099/ijs.0.000125 or an error occurred 6: you may not have access to 10.1099/ijs.0.63769-0 or an error occurred 7: you may not have access to 10.1099/ijs.0.049106-0 or an error occurred 8: you may not have access to 10.1099/ijsem.0.001928 or an error occurred 9: you may not have access to 10.1099/ijs.0.65467-0 or an error occurred 10: you may not have access to 10.1099/ijsem.0.002131 or an error occurred 11: you may not have access to 10.1099/00207713-50-4-1655 or an error occurred 12: you may not have access to 10.1099/00207713-51-2-489 or an error occurred 13: you may not have access to 10.1099/ijs.0.068296-0 or an error occurred 14: you may not have access to 10.1099/ijs.0.038844-0 or an error occurred 15: you may not have access to 10.1099/ijs.0.053009-0 or an error occurred 16: you may not have access to 10.1099/ijs.0.022517-0 or an error occurred 17: you may not have access to 10.1099/ijs.0.023580-0 or an error occurred 18: you may not have access to 10.1099/ijs.0.064345-0 or an error occurred 19: you may not have access to 10.1099/ijs.0.02735-0 or an error occurred 20: you may not have access to 10.1099/ijsem.0.002212 or an error occurred 21: you may not have access to 10.1099/ijs.0.041178-0 or an error occurred 22: you may not have access to 10.1099/ijs.0.009258-0 or an error occurred 23: you may not have access to 10.1099/ijs.0.02505-0 or an error occurred 24: you may not have access to 10.1099/ijs.0.02377-0 or an error occurred 25: you may not have access to 10.1099/ijsem.0.001064 or an error occurred 26: you may not have access to 10.1099/ijs.0.056499-0 or an error occurred 27: you may not have access to 10.1099/ijs.0.001149-0 or an error occurred 28: you may not have access to 10.1099/ijs.0.000167 or an error occurred 29: you may not have access to 10.1099/ijsem.0.000979 or an error occurred

I have access to most through the browser (I don't have access to 10.1099/ijsem.0.002131 as it is less than 6 months old).

https://ftdoi.org/api/doi/10.1099/ijsem.0.001064/ gives a URL that actually works. Tried it again seperatly ft_get('10.1099/ijsem.0.001064') and it worked. Same for 10.1099/ijsem.0.000979.

https://ftdoi.org/api/doi/10.1099/ijs.0.64812-0/
has faulty file link:
http://ijs.microbiologyresearch.org/deliver/fulltext/ijsem/57/7/1442_ijsem0.pdf
but the file is actually found at:
http://ijs.microbiologyresearch.org/deliver/fulltext/ijsem/57/7/1442.pdf

https://ftdoi.org/api/doi/10.1099/00207713-51-3-731
has faulty file link:
http://ijs.microbiologyresearch.org/deliver/fulltext/ijsem/51/3/731_ijsem731.pdf
but I can't ID a non-user specific url that works

...

@sckott
Copy link
Contributor

sckott commented Jun 12, 2018

thanks for the details here @low-decarie - will have a look

i think the message i included you may not have access to DOI or an error occurred doesn't necessarily mean you don't have access

@sckott
Copy link
Contributor

sckott commented Jun 12, 2018

😢 oof, another exception to the rules i thought i figured out. so for now i changed the internals of ftdoi.org to just scrape the html landing page to get the pdf url, this does mean that requests for this publisher will be a bit slower unfortunately.

I tried with up to 50 DOIs and they all work now for me.

side note that the first DOI in the list I think had a typo, instead of 10.1099/ijs.0-011122-0 should be 10.1099/ijs.0.011122-0

@low-decarie
Copy link
Author

Fantastic! 🥇 ! Thank you!

@sckott sckott added this to the v1.1.0 milestone Jun 13, 2018
@sckott sckott closed this as completed Jun 13, 2018
@sckott
Copy link
Contributor

sckott commented Jun 13, 2018

should speed up in the future, will do caching sckott/pubpatternsapi#7 of the JSON response from the API which is used inside fulltext

@sckott
Copy link
Contributor

sckott commented Jun 13, 2018

caching added to the API - 2nd and so on request to the same route will be cached for 24 hrs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants