Skip to content
This repository has been archived by the owner on Sep 9, 2022. It is now read-only.

Fixing Wiley issues and adding file-types #73

Merged
merged 5 commits into from Dec 2, 2015

Conversation

willpearse
Copy link
Contributor

Where the suffix seems reasonable (*.4-letters) an attr of suffix is added to the return that is the file's type. This happens on download (...if you rename a file I can't figure out its type...), but given caching is the default it shouldn't be difficult for the user to just download, double-check the file type (...though, honestly, they should probably know this before downloading!) and then go from there. Address discussion on #3 and #69

The Wiley issue ( #71 ) is addressed in this, in that Ecology Letters articles download (which have been updated), and other Wiley journals that haven't been updated also download (with check and tests for that as well). It looks like Wiley haven't finished updating their website yet and (frankly) the new layout for Ecology Letters doesn't load on my browser, let alone in R. If they retain the backwards-compatibility page on the same link they have now, this code will continue to work (hopefully) whatever they do to the new version. I'll have to revisit this; I have at least written a number of Wiley tests now so we should know when other journals etc. break.

@sckott
Copy link
Contributor

sckott commented Nov 14, 2015

awesome, thanks

Testing locally, getting some failed tests, e.,g, :

<r> expect_true(file.exists(ft_get_si("10.1371/journal.pone.0127900", 1)))
#> Error in get_si_pub(x) : 
#>  Cannot find publisher for DOI: 10.1371/journal.pone.0127900 

<r> expect_true(file.exists(ft_get_si("10.1111/ele.12437", 1)))
#>  Error in get_si_pub(x) : Cannot find publisher for DOI: 10.1111/ele.12437

<r> expect_true(file.exists(ft_get_si("10.1126/science.1255768", "Appendix_BanksLeite_etal.txt")))
#> Error in get_si_pub(x) : 
#>  Cannot find publisher for DOI: 10.1126/science.1255768 

Seems like this stems from https://github.com/willpearse/fulltext/blob/master/R/ft_get_si_plugins.R#L10 - the object pub is a data.frame, and at least for me that logical test sometimes gives TRUE when it shouldn't -

@willpearse
Copy link
Contributor Author

Sorry for the delay getting back to you. This is disturbing to me, because these work locally for me. I do run all tests before PR-ing; clearly something is going wrong from my end. More soon; sorry about this.

@sckott
Copy link
Contributor

sckott commented Nov 15, 2015

thanks 👍

@willpearse
Copy link
Contributor Author

I've got rid of what could be causing it, but it's hard for me to say anything as (as I say) I can't trigger those errors. I do, however, sometimes get errors when using Wiley et al., and I'm concerned that's something server-side because sometimes it works, and sometimes it doesn't. Again, could just be me - frankly, I can't get the Ecology Letters website to load sometime on my machine!

@sckott
Copy link
Contributor

sckott commented Nov 15, 2015

Thanks. I'll dig in to it a little bit. Just to be sure, can you paste in your R session info, just in case this is due to diff. versions of dependencies.

@willpearse
Copy link
Contributor Author

No worries. Again, I'm sorry about this:

R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 15.10

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] fulltext_0.1.4.9000

loaded via a namespace (and not attached):
 [1] NLP_0.1-8         Rcpp_0.12.0       plyr_1.8.2        R.methodsS3_1.7.0
 [5] R.utils_2.1.0     tools_3.2.2       digest_0.6.8      jsonlite_0.9.16  
 [9] lubridate_1.3.3   memoise_0.2.1     gtable_0.1.2      R.cache_0.10.0   
[13] rcrossref_0.3.4   aRxiv_0.5.10      bibtex_0.4.0      DBI_0.3.1        
[17] parallel_3.2.2    proto_0.3-10      xml2_0.1.1        dplyr_0.4.2      
[21] httr_1.0.0        stringr_1.0.0     rredis_1.7.0      grid_3.2.2       
[25] R6_2.0.1          rentrez_0.4.2     XML_3.98-1.1      ggplot2_1.0.1    
[29] reshape2_1.4.1    solr_0.1.4        magrittr_1.5      whisker_0.3-2    
[33] rplos_0.5.2       scales_0.2.4      MASS_7.3-43       assertthat_0.1   
[37] colorspace_1.2-6  stringi_1.0-1     munsell_0.4.2     slam_0.1-32      
[41] tm_0.6-2          rjson_0.2.15      R.oo_1.19.0

@sckott
Copy link
Contributor

sckott commented Nov 17, 2015

conferencing for the next 2 days, i'll get to it as soon as I can

@willpearse
Copy link
Contributor Author

No worries; I completely understand and I'm grateful you're looking at it. Maybe something will occur to me...

@sckott
Copy link
Contributor

sckott commented Nov 26, 2015

hey @willpearse - seems like they give the entire URL in the html. are there problems with just pulling out the entire url, rather than part of it, then constructing by hand? seems like pulling out entire URL will help avoid problems where articles differ in their base urls. Perhaps this

html <- xml2::read_html(html)
urls <- xml2::xml_attr(xml2::xml_find_all(html, '//a[contains(@href,"supinfo")]'), "href")
url <- urls[si]

could replace the lines https://github.com/willpearse/fulltext/blob/master/R/ft_get_si_plugins.R#L79-L95 or some of them at least - I'm not sure if there's some checks that need to stay there

@willpearse
Copy link
Contributor Author

Potentially, but that's not going to fix the tests you flag that fail locally above, right?

@sckott
Copy link
Contributor

sckott commented Nov 27, 2015

tests pass with that change. of course there could be unwanted behavior that the tests don;t catch

but it seems like at least the change is capturing URLs correctly

@willpearse
Copy link
Contributor Author

...then I think this is something very weird because that code is in the Wiley checker, and those tests use the PLoS and Science plugins!

Either way, I'll check and then commit the changes later today...

@willpearse
Copy link
Contributor Author

Checked and good to go, as far as I can see. As I say, none of these changes affect the other problems you were having that I can see, but if they're working your end and they're working my end then I'm happy :D

@sckott
Copy link
Contributor

sckott commented Nov 27, 2015

Okay, the changes I suggested were to make it easier to pull out the URL, and not related to the problem brought up in #73 (comment)

@willpearse
Copy link
Contributor Author

Ah, OK, I see. Either way, I'm grateful of the suggestions :D

sckott added a commit that referenced this pull request Dec 2, 2015
Fixing Wiley issues and adding file-types
@sckott sckott merged commit 1e51f3e into ropensci-archive:master Dec 2, 2015
@sckott sckott added this to the v0.2 milestone Dec 2, 2015
@sckott sckott mentioned this pull request Jan 20, 2016
@sckott sckott modified the milestones: v0.2, v0.1.6 Jan 20, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants