identify open content license urls used for open access articles in hybrid journals #81

njahn82 · 2020-03-06T10:21:32Z

Here's a reproducible example (reprex) to obtain licenses used for all hybrid journals covered by the Open APC initiative.

# required libraries
library(dplyr) # data transformation
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr) # working with list-columns
library(jsonlite) # working with json files
# load data, most recent dump, which also includes data from Jan and Feb 2020
license_df <- jsonlite::stream_in(url("https://raw.githubusercontent.com/subugoe/hybrid_oa_dashboard/update_jan_feb_20/data/jn_facets_df.json"), verbose = FALSE)
# prepare a summary table, where all license URLs´s variants are broken down by publisher
license_df %>%
  select(license_refs, journal_title, publisher) %>%
  unnest(license_refs) %>%
  # unterschiedliche Fälle je Verlag
  group_by(.id, publisher) %>%
  summarise(n_cases = sum(V1)) 
#> # A tibble: 567 x 3
#> # Groups:   .id [216]
#>    .id                                    publisher                 n_cases
#>    <chr>                                  <chr>                       <int>
#>  1 http:// creativecommons.org/licenses/… Cambridge University Pre…       3
#>  2 http://academic.oup.com/journals/page… Elsevier BV                     1
#>  3 http://academic.oup.com/journals/page… Oxford University Press …    4734
#>  4 http://academic.oup.com/journals/page… Oxford University Press …      23
#>  5 http://aspb.org/publications/aspb-jou… American Society of Plan…    4039
#>  6 http://avs.scitation.org/jvb/authors/… American Vacuum Society        35
#>  7 http://creative commons.org/licenses/… Cambridge University Pre…       1
#>  8 http://creative commons.org/licenses/… Cambridge University Pre…       1
#>  9 http://creative commons.org/licenses/… Cambridge University Pre…       1
#> 10 http://creative%20commons.org/license… Cambridge University Pre…       2
#> # … with 557 more rows

^{Created on 2020-03-06 by the reprex package (v0.3.0)}

njahn82 · 2020-03-06T10:27:44Z

Some background:

Existing White List
https://github.com/subugoe/hybrid_oa_dashboard/blob/8e1e50d9403ec90a94c699e51919a46aeb1c0418/R/cr_fetching.R#L192-L203

Existing script to harmonize license urls https://github.com/subugoe/hybrid_oa_dashboard/blob/8e1e50d9403ec90a94c699e51919a46aeb1c0418/R/license_normalise.R#L4-L20

Related approach with comprehensive White List: “Applying Crossref and Unpaywall information to identify gold, hidden gold, hybrid and delayed Open Access publications in the KB publication corpus”: https://osf.io/preprints/socarxiv/sdzft/

njahn82 · 2020-03-19T14:12:54Z

Filtering journals where no license urls were shared using the keep_empty = TRUE param from tidyr::unnest()

# required libraries
library(dplyr) # data transformation
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr) # working with list-columns
library(jsonlite) # working with json files
# load data, most recent dump, which also includes data from Jan and Feb 2020
license_df <- jsonlite::stream_in(url("https://raw.githubusercontent.com/subugoe/hybrid_oa_dashboard/update_jan_feb_20/data/jn_facets_df.json"), verbose = FALSE)
# prepare a summary table, where all license URLs´s variants are broken down by publisher
license_df %>%
  select(license_refs, journal_title, publisher) %>%
  unnest(license_refs, keep_empty =TRUE) %>%
  filter(is.na(.id))
#> # A tibble: 470 x 4
#>    .id      V1 journal_title                     publisher                 
#>    <chr> <int> <chr>                             <chr>                     
#>  1 <NA>     NA Natures Sciences Sociétés         EDP Sciences              
#>  2 <NA>     NA Journal of Neuroscience           Society for Neuroscience  
#>  3 <NA>     NA Genes & Development               Cold Spring Harbor Labora…
#>  4 <NA>     NA Physiological Genomics            American Physiological So…
#>  5 <NA>     NA Climate Research                  Inter-Research Science Ce…
#>  6 <NA>     NA Jahrbuch der Österreichischen By… Osterreichische Akademie …
#>  7 <NA>     NA Zeitschrift für Antikes Christen… Walter de Gruyter GmbH    
#>  8 <NA>     NA Journal of Lipid Research         American Society for Bioc…
#>  9 <NA>     NA Molecular Biology of the Cell     American Society for Cell…
#> 10 <NA>     NA Journal of Biological Chemistry   American Society for Bioc…
#> # … with 460 more rows

^{Created on 2020-03-19 by the reprex package (v0.3.0)}

maxheld83 · 2020-06-03T20:28:01Z

I haven't touched any of this substantively, but I've created some scaffolding in 8e3cc1e.
I think the functions mentioned above to parse/corall/summarise the licensing info should be documented in the same place.

license_patterns is currently a static data frame, but once the above is implemented it should probably be re-evaluated on every build time (though not run time).
It might then have to be a function as per #47.

…Crossref #81

njahn82 added this to the AP 1.2 Analyse und Typologisierung der Lizenzinformationen in Crossref milestone Mar 6, 2020

njahn82 assigned jhoeffler Mar 6, 2020

maxheld83 added the ETL extract, transform, load label Jun 3, 2020

maxheld83 added a commit that referenced this issue Jun 3, 2020

factor out and document license_patterns as tribble starting on #81

8e3cc1e

maxheld83 mentioned this issue Jun 3, 2020

factor out license patterns #213

Merged

njahn82 unassigned jhoeffler Jun 18, 2020

njahn82 added a commit that referenced this issue Jun 18, 2020

prepare manual license check #81

772ecf6

njahn82 added a commit that referenced this issue Jun 18, 2020

prepare spreadsheet for manula validation of license URLs indexed in …

aacce38

…Crossref #81

njahn82 mentioned this issue Jun 18, 2020

prepare spreadsheet for manual validation of license URLs indexed in … #236

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

identify open content license urls used for open access articles in hybrid journals #81

identify open content license urls used for open access articles in hybrid journals #81

njahn82 commented Mar 6, 2020

njahn82 commented Mar 6, 2020 •

edited

Loading

njahn82 commented Mar 19, 2020 •

edited

Loading

maxheld83 commented Jun 3, 2020 •

edited

Loading

identify open content license urls used for open access articles in hybrid journals #81

identify open content license urls used for open access articles in hybrid journals #81

Comments

njahn82 commented Mar 6, 2020

njahn82 commented Mar 6, 2020 • edited Loading

njahn82 commented Mar 19, 2020 • edited Loading

maxheld83 commented Jun 3, 2020 • edited Loading

njahn82 commented Mar 6, 2020 •

edited

Loading

njahn82 commented Mar 19, 2020 •

edited

Loading

maxheld83 commented Jun 3, 2020 •

edited

Loading