Skip to content
This repository has been archived by the owner on Mar 27, 2023. It is now read-only.

identify open content license urls used for open access articles in hybrid journals #81

Open
njahn82 opened this issue Mar 6, 2020 · 3 comments
Labels
ETL extract, transform, load

Comments

@njahn82
Copy link
Collaborator

njahn82 commented Mar 6, 2020

Here's a reproducible example (reprex) to obtain licenses used for all hybrid journals covered by the Open APC initiative.

# required libraries
library(dplyr) # data transformation
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr) # working with list-columns
library(jsonlite) # working with json files
# load data, most recent dump, which also includes data from Jan and Feb 2020
license_df <- jsonlite::stream_in(url("https://raw.githubusercontent.com/subugoe/hybrid_oa_dashboard/update_jan_feb_20/data/jn_facets_df.json"), verbose = FALSE)
# prepare a summary table, where all license URLs´s variants are broken down by publisher
license_df %>%
  select(license_refs, journal_title, publisher) %>%
  unnest(license_refs) %>%
  # unterschiedliche Fälle je Verlag
  group_by(.id, publisher) %>%
  summarise(n_cases = sum(V1)) 
#> # A tibble: 567 x 3
#> # Groups:   .id [216]
#>    .id                                    publisher                 n_cases
#>    <chr>                                  <chr>                       <int>
#>  1 http:// creativecommons.org/licenses/… Cambridge University Pre…       3
#>  2 http://academic.oup.com/journals/page… Elsevier BV                     1
#>  3 http://academic.oup.com/journals/page… Oxford University Press …    4734
#>  4 http://academic.oup.com/journals/page… Oxford University Press …      23
#>  5 http://aspb.org/publications/aspb-jou… American Society of Plan…    4039
#>  6 http://avs.scitation.org/jvb/authors/… American Vacuum Society        35
#>  7 http://creative commons.org/licenses/… Cambridge University Pre…       1
#>  8 http://creative commons.org/licenses/… Cambridge University Pre…       1
#>  9 http://creative commons.org/licenses/… Cambridge University Pre…       1
#> 10 http://creative%20commons.org/license… Cambridge University Pre…       2
#> # … with 557 more rows

Created on 2020-03-06 by the reprex package (v0.3.0)

@njahn82
Copy link
Collaborator Author

njahn82 commented Mar 6, 2020

Some background:

Existing White List
https://github.com/subugoe/hybrid_oa_dashboard/blob/8e1e50d9403ec90a94c699e51919a46aeb1c0418/R/cr_fetching.R#L192-L203

Existing script to harmonize license urls https://github.com/subugoe/hybrid_oa_dashboard/blob/8e1e50d9403ec90a94c699e51919a46aeb1c0418/R/license_normalise.R#L4-L20

Related approach with comprehensive White List: “Applying Crossref and Unpaywall information to identify gold, hidden gold, hybrid and delayed Open Access publications in the KB publication corpus”: https://osf.io/preprints/socarxiv/sdzft/

@njahn82
Copy link
Collaborator Author

njahn82 commented Mar 19, 2020

Filtering journals where no license urls were shared using the keep_empty = TRUE param from tidyr::unnest()

# required libraries
library(dplyr) # data transformation
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr) # working with list-columns
library(jsonlite) # working with json files
# load data, most recent dump, which also includes data from Jan and Feb 2020
license_df <- jsonlite::stream_in(url("https://raw.githubusercontent.com/subugoe/hybrid_oa_dashboard/update_jan_feb_20/data/jn_facets_df.json"), verbose = FALSE)
# prepare a summary table, where all license URLs´s variants are broken down by publisher
license_df %>%
  select(license_refs, journal_title, publisher) %>%
  unnest(license_refs, keep_empty =TRUE) %>%
  filter(is.na(.id))
#> # A tibble: 470 x 4
#>    .id      V1 journal_title                     publisher                 
#>    <chr> <int> <chr>                             <chr>                     
#>  1 <NA>     NA Natures Sciences Sociétés         EDP Sciences              
#>  2 <NA>     NA Journal of Neuroscience           Society for Neuroscience  
#>  3 <NA>     NA Genes & Development               Cold Spring Harbor Labora…
#>  4 <NA>     NA Physiological Genomics            American Physiological So…
#>  5 <NA>     NA Climate Research                  Inter-Research Science Ce…
#>  6 <NA>     NA Jahrbuch der Österreichischen By… Osterreichische Akademie …
#>  7 <NA>     NA Zeitschrift für Antikes Christen… Walter de Gruyter GmbH    
#>  8 <NA>     NA Journal of Lipid Research         American Society for Bioc…
#>  9 <NA>     NA Molecular Biology of the Cell     American Society for Cell…
#> 10 <NA>     NA Journal of Biological Chemistry   American Society for Bioc…
#> # … with 460 more rows

Created on 2020-03-19 by the reprex package (v0.3.0)

@maxheld83 maxheld83 added the ETL extract, transform, load label Jun 3, 2020
@maxheld83
Copy link
Contributor

maxheld83 commented Jun 3, 2020

I haven't touched any of this substantively, but I've created some scaffolding in 8e3cc1e.
I think the functions mentioned above to parse/corall/summarise the licensing info should be documented in the same place.

license_patterns is currently a static data frame, but once the above is implemented it should probably be re-evaluated on every build time (though not run time).
It might then have to be a function as per #47.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
ETL extract, transform, load
Projects
None yet
Development

No branches or pull requests

3 participants