Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upfulltext sources #4
Comments
|
|
There's also Both have OAI-PMH interfaces that I strated to wrap. They may or may not give full text, I think they do, can get links to full text from OAI-PMH service These aren't worth making separate packages, I'll fold these in here somehow, with the other OAI-PMH data sources |
|
Peerj We should be able to get full text of PeerJ papers, but we won't have a way to search them programatically as far as I know. THough maybe we could search via crossref metadata search e.g. F1000Research Should be able to get them through Pubmed I think, e,g. http://www.ncbi.nlm.nih.gov/pubmed/25110583 - all F1000 papers are OA, right? |
|
In terms of working out if an article/journal is available in PMC. NCBI has a table of journals and their level of participation. Here's the search for F1000Research (which is indeed all open access): http://www.ncbi.nlm.nih.gov/pmc/journals/?term=F1000Research&titles=all&search=journals I can't tell if they allow programmatic access to the search, but the you can download a flat table (.csv) with ~2000 rows. Might be helpful in working out how to retrieve a full text copy of an article? |
|
Nice, thanks for that info! Wasn't aware of that table. And PeerJ is in there too. We can get DOIs drilling down through this table, then at least for PeerJ, use the number in the DOI (e.g., 228 in 10.7717/peerj.228) to construct the url to the full text XML (e.g. https://peerj.com/articles/228.xml) @dwinter - is there a way in |
|
Yup, as ever you just have to look up how NCBI wants you to format the search term: http://www.ncbi.nlm.nih.gov/books/NBK3825/#pmchelp.Searching_by_journal_title
You can also use the doi to search on a per-article basis:
|
|
Nice, thanks @dwinter ! |
|
The master list, i'll update this as we get more info:
|
|
Via its Data for Research service, JSTOR provides full text in the form of word counts (ie. one csv or xml file per article, each file has two columns, one of words, one of counts of those words). You can only get the data after you've registered, submitted a request, and waited for the archive to be processed by the DFR servers and delivered as a zip file, which can take hours to days. So it's rather awkward to deal with programmatically, as @sckott notes. That said, I'm happy to work on my JSTORr package to harmonize it with this one, as much as that's possible. |
|
There is also Arxiv. On Fri, Aug 15, 2014 at 6:37 PM, Ben Marwick notifications@github.com
|
|
@benmarwick Thanks for the update. Okay, let's keep your pkg in mind, and see how progress goes on this pkg and we can determine at a later date if it is a good fit here |
|
We could also consider adding the global climate change information system. I've thought it would be good to have a package for this since I learned about it a couple months ago. The system is super robust and gives access to summaries of 100's of climate change reports. I think full text could be accessed via links to pdf's and then I know that TM can handle pdf's pretty easily. While the whole API is probably one of the most robust I've ever seen, we could start off with a stripped down version that gives access to pdf full text and findings summaries. Maybe this doesn't fit in because it's not traditional peer review. |
|
@emhart Nice, I'll add it to the list. |
|
@sckott I've been looking into how to do biorxiv. An alternative to the strategy you laid out is to do this with scraping. The basic search interface is actually really easy to implement, and scrape. http://biorxiv.org/search/climate. There's an advanced search that would be a bit more complicated to reverse engineer, but doesn't look terrible. This may be true for other sources. It raises 2 questions.
So it seems like integrating biorxiv could definitely be done without too much trouble. |
|
@sckott I created a new branch and added biorxiv search functionality. This effectively scrapes the webpages and grabs the URL's for papers from the search. Not sure of the best way to pull in the full text though. Currently we can grab abstracts easily, but full text is PDF's. We can parse the PDF's, but that requires some libraries on the users end outside of R (http://poppler.freedesktop.org/). How do you handle this @kbroman in the arxiv pkg? Can you get full text in formats other than PDF? Or do you have some PDF parsing magic? Also @sckott Do you think it's worth it to flush out the search beyond the basic keyword? I'm not sure it will provide much advantage for us, as the current search searches all fields (full text, abstracts and authors). See http://www.biorxiv.org/search for a full list of possible search fields. |
|
I haven't tried grabbing full text in the aRxiv package. I'm just taking what the API gives me. But the API does include the link to the PDF, so I don't have to scrape the abstract pages for that. |
|
Thanks @kbroman. I know we can read in PDFs via
|
|
@emhart we can start with the basic search for biorxiv, then modify if needed i think. I wonder if it would be worth allowing user to optionally pass a pdf through another web service to extract text, then pull that down. Some folks may want to extract their own text, but others may not want to, slash not know how to. e.g., |
I would suggest thinking through this as a modular approach. Downloading and passing PDFs through another service adds so much more overhead that querying text off of APIs. IMO it would be a good idea to abstract out such heavy functionality into a helper package that people can call as needed. Another question is that these services wont be free, or if free come with limited features, so that would require additional API keys (+ paid accounts) to work with. |
|
yeah, modular is best |
|
For PMC, we may want to import @cstubben package https://github.com/cstubben/pmcXML, see recent example in mailing list: https://groups.google.com/d/msg/ropensci-discuss/T5aPR8e9RYc/GcK8-CwHMz4J |
|
Scopus provides full-text api for Science Direct with api keys (they have an interfaces where you can create your own). Would you consider including that? |
|
@sinandrei we have included Scopus now, see |
|
@sckott indeed. cool!. Am I right in thinking that the chunking for the elsevier material is not yet implemented? |
|
what do you mean by chunking? |
|
the x %>% chunks("publisher") %>% tabularize() type syntax did not return
any values for my elsevier type ft_gets()
…On Mon, Oct 16, 2017 at 11:37 PM, Scott Chamberlain < ***@***.***> wrote:
what do you mean by chunking?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHHM2fAG7bbSjx5FUN6jmTyVHQP3AqsQks5ss8y3gaJpZM4CVnpX>
.
|
|
can you open a new issue with a reproducible example. this is getting off topic of this issue |
What sources should we include in
fulltext? Currently we have:rplosbmcelifefor fulltext. We should also be able to get fulltext from F1000 via RSS, and I'm not sure about PeerJ.
We can also get metadata for most pubmed papers via
rentrezand we can get metadata access to arxiv papers: http://arxiv.org/help/bulk_dataWhat sources are we missing?