Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fulltext sources #4

Closed
emhart opened this issue Aug 8, 2014 · 27 comments
Closed

fulltext sources #4

emhart opened this issue Aug 8, 2014 · 27 comments
Labels

Comments

@emhart
Copy link
Member

@emhart emhart commented Aug 8, 2014

What sources should we include in fulltext? Currently we have:

  • rplos
  • bmc
  • elife

for fulltext. We should also be able to get fulltext from F1000 via RSS, and I'm not sure about PeerJ.

We can also get metadata for most pubmed papers via rentrez and we can get metadata access to arxiv papers: http://arxiv.org/help/bulk_data

What sources are we missing?

@emhart emhart added the Planning label Aug 8, 2014
@sckott
Copy link
Member

@sckott sckott commented Aug 8, 2014

  • Pubmed Central full text via rentrez - maintained by @dwinter
@sckott
Copy link
Member

@sckott sckott commented Aug 11, 2014

There's also

Both have OAI-PMH interfaces that I strated to wrap. They may or may not give full text, I think they do, can get links to full text from OAI-PMH service

These aren't worth making separate packages, I'll fold these in here somehow, with the other OAI-PMH data sources

@sckott
Copy link
Member

@sckott sckott commented Aug 12, 2014

Peerj We should be able to get full text of PeerJ papers, but we won't have a way to search them programatically as far as I know. THough maybe we could search via crossref metadata search e.g.

F1000Research Should be able to get them through Pubmed I think, e,g. http://www.ncbi.nlm.nih.gov/pubmed/25110583 - all F1000 papers are OA, right?

@dwinter
Copy link
Member

@dwinter dwinter commented Aug 12, 2014

In terms of working out if an article/journal is available in PMC. NCBI has a table of journals and their level of participation. Here's the search for F1000Research (which is indeed all open access):

http://www.ncbi.nlm.nih.gov/pmc/journals/?term=F1000Research&titles=all&search=journals

I can't tell if they allow programmatic access to the search, but the you can download a flat table (.csv) with ~2000 rows. Might be helpful in working out how to retrieve a full text copy of an article?

@sckott
Copy link
Member

@sckott sckott commented Aug 12, 2014

Nice, thanks for that info! Wasn't aware of that table.

And PeerJ is in there too.

We can get DOIs drilling down through this table, then at least for PeerJ, use the number in the DOI (e.g., 228 in 10.7717/peerj.228) to construct the url to the full text XML (e.g. https://peerj.com/articles/228.xml)

@dwinter - is there a way in rentrez to query pubmed based on a particular journal?

@dwinter
Copy link
Member

@dwinter dwinter commented Aug 12, 2014

Yup, as ever you just have to look up how NCBI wants you to format the search term:

http://www.ncbi.nlm.nih.gov/books/NBK3825/#pmchelp.Searching_by_journal_title

pmc_search <- entrez_search(db="pmc", term="PeerJ[journal]")
pmc_search$count
    # [1] 496

You can also use the doi to search on a per-article basis:

pmc_search <- entrez_search(db="pmc", term="10.7717/peerj.228[doi]")
paper <- xmlTreeParse( entrez_fetch(db="pmc", id=pmc_search$ids, rettype="xml"), useInternalNodes=TRUE)
@sckott
Copy link
Member

@sckott sckott commented Aug 12, 2014

Nice, thanks @dwinter !

@sckott
Copy link
Member

@sckott sckott commented Aug 16, 2014

The master list, i'll update this as we get more info:

@benmarwick
Copy link

@benmarwick benmarwick commented Aug 16, 2014

Via its Data for Research service, JSTOR provides full text in the form of word counts (ie. one csv or xml file per article, each file has two columns, one of words, one of counts of those words).

You can only get the data after you've registered, submitted a request, and waited for the archive to be processed by the DFR servers and delivered as a zip file, which can take hours to days.

So it's rather awkward to deal with programmatically, as @sckott notes. That said, I'm happy to work on my JSTORr package to harmonize it with this one, as much as that's possible.

@karthik
Copy link
Member

@karthik karthik commented Aug 16, 2014

There is also Arxiv.

On Fri, Aug 15, 2014 at 6:37 PM, Ben Marwick notifications@github.com
wrote:

Via its Data for Research service, JSTOR provides full text in the form of word counts (ie. one csv or xml file per article, each file has two columns, one of words, one of counts of those words).
You can only get the data after you've registered, submitted a request, and waited for the archive to be processed by the DFR servers and delivered as a zip file, which can take hours to days.

So it's rather awkward to deal with programmatically, as @sckott notes. That said, I'm happy to work on my JSTORr package to harmonize it with this one, as much as that's possible.

Reply to this email directly or view it on GitHub:
#4 (comment)

@sckott
Copy link
Member

@sckott sckott commented Aug 16, 2014

@benmarwick Thanks for the update. Okay, let's keep your pkg in mind, and see how progress goes on this pkg and we can determine at a later date if it is a good fit here

@emhart
Copy link
Member Author

@emhart emhart commented Aug 26, 2014

We could also consider adding the global climate change information system. I've thought it would be good to have a package for this since I learned about it a couple months ago. The system is super robust and gives access to summaries of 100's of climate change reports. I think full text could be accessed via links to pdf's and then I know that TM can handle pdf's pretty easily. While the whole API is probably one of the most robust I've ever seen, we could start off with a stripped down version that gives access to pdf full text and findings summaries. Maybe this doesn't fit in because it's not traditional peer review.

@sckott
Copy link
Member

@sckott sckott commented Sep 7, 2014

@emhart Nice, I'll add it to the list.

@emhart
Copy link
Member Author

@emhart emhart commented Sep 20, 2014

@sckott I've been looking into how to do biorxiv. An alternative to the strategy you laid out is to do this with scraping. The basic search interface is actually really easy to implement, and scrape. http://biorxiv.org/search/climate. There's an advanced search that would be a bit more complicated to reverse engineer, but doesn't look terrible. This may be true for other sources. It raises 2 questions.

  1. Is having an API a requirement for inclusion in the package?
  2. Do we want each component to be a standalone package similar to spocc? It seems like we don't necessarily want this to be the case, but do we have an explicitly stated policy on this?

So it seems like integrating biorxiv could definitely be done without too much trouble.

emhart added a commit that referenced this issue Sep 20, 2014
@emhart
Copy link
Member Author

@emhart emhart commented Sep 20, 2014

@sckott I created a new branch and added biorxiv search functionality. This effectively scrapes the webpages and grabs the URL's for papers from the search. Not sure of the best way to pull in the full text though. Currently we can grab abstracts easily, but full text is PDF's. We can parse the PDF's, but that requires some libraries on the users end outside of R (http://poppler.freedesktop.org/). How do you handle this @kbroman in the arxiv pkg? Can you get full text in formats other than PDF? Or do you have some PDF parsing magic?

Also @sckott Do you think it's worth it to flush out the search beyond the basic keyword? I'm not sure it will provide much advantage for us, as the current search searches all fields (full text, abstracts and authors). See http://www.biorxiv.org/search for a full list of possible search fields.

@kbroman
Copy link
Member

@kbroman kbroman commented Sep 20, 2014

I haven't tried grabbing full text in the aRxiv package. I'm just taking what the API gives me. But the API does include the link to the PDF, so I don't have to scrape the abstract pages for that.

@emhart
Copy link
Member Author

@emhart emhart commented Sep 20, 2014

Thanks @kbroman. I know we can read in PDFs via tm but it requires an
underlying library to do so. Also you might need to download them
locally. It seems that for fulltext if we go this route of downloading
PDFs we should do it in a common directory across pkgs like arxiv and other
sources.
On Sep 20, 2014 6:49 AM, "Karl Broman" notifications@github.com wrote:

I haven't tried grabbing full text in the aRxiv package
https://github.com/ropensci/aRxiv. I'm just taking what the API gives
me. But the API does include the link to the PDF, so I don't have to scrape
the abstract pages for that.

Reply to this email directly or view it on GitHub
#4 (comment).

@sckott
Copy link
Member

@sckott sckott commented Sep 20, 2014

@emhart we can start with the basic search for biorxiv, then modify if needed i think.

I wonder if it would be worth allowing user to optionally pass a pdf through another web service to extract text, then pull that down. Some folks may want to extract their own text, but others may not want to, slash not know how to. e.g.,

@karthik
Copy link
Member

@karthik karthik commented Sep 20, 2014

I wonder if it would be worth allowing user to optionally pass a pdf through another web service to extract text, then pull that down.

I would suggest thinking through this as a modular approach. Downloading and passing PDFs through another service adds so much more overhead that querying text off of APIs. IMO it would be a good idea to abstract out such heavy functionality into a helper package that people can call as needed.

Another question is that these services wont be free, or if free come with limited features, so that would require additional API keys (+ paid accounts) to work with.

@sckott
Copy link
Member

@sckott sckott commented Sep 20, 2014

yeah, modular is best

@sckott
Copy link
Member

@sckott sckott commented Dec 18, 2014

For PMC, we may want to import @cstubben package https://github.com/cstubben/pmcXML, see recent example in mailing list: https://groups.google.com/d/msg/ropensci-discuss/T5aPR8e9RYc/GcK8-CwHMz4J

@andreifoldes
Copy link

@andreifoldes andreifoldes commented Oct 15, 2017

Scopus provides full-text api for Science Direct with api keys (they have an interfaces where you can create your own). Would you consider including that?

@sckott
Copy link
Member

@sckott sckott commented Oct 15, 2017

@sinandrei we have included Scopus now, see ft_abstract, ft_search, and ft_get

@andreifoldes
Copy link

@andreifoldes andreifoldes commented Oct 16, 2017

@sckott indeed. cool!. Am I right in thinking that the chunking for the elsevier material is not yet implemented?

@sckott
Copy link
Member

@sckott sckott commented Oct 16, 2017

what do you mean by chunking?

@andreifoldes
Copy link

@andreifoldes andreifoldes commented Oct 16, 2017

@sckott
Copy link
Member

@sckott sckott commented Oct 16, 2017

can you open a new issue with a reproducible example. this is getting off topic of this issue

@sckott sckott closed this Oct 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.