-
-
Notifications
You must be signed in to change notification settings - Fork 106
Description
Submitting Author Name: Daniel Antal
Submitting Author Github Handle: @antaldaniel
Repository: https://github.com/dataobservatory-eu/dataset/
Version submitted: 0.1.7
Submission type: Standard
Editor: @maelle
Reviewers: @msperlin, @romanflury
Due date for @romanflury: 2022-09-21
Archive: TBD
Version accepted: TBD
Language: en
- Paste the full DESCRIPTION file inside a code block below:
Package: dataset
Title: Create Data Frames that are Easier to Exchange and Reuse
Date: 2022-08-19
Version: 0.1.7.3
Authors@R:
person(given = "Daniel", family = "Antal",
email = "daniel.antal@dataobservatory.eu",
role = c("aut", "cre"),
comment = c(ORCID = "0000-0001-7513-6760")
)
Description: The aim of the 'dataset' package is to make tidy datasets easier to release,
exchange and reuse. It organizes and formats data frame 'R' objects into well-referenced,
well-described, interoperable datasets into release and reuse ready form. A subjective
interpretation of the W3C DataSet recommendation and the datacube model <https://www.w3.org/TR/vocab-data-cube/>,
which is also used in the global Statistical Data and Metadata eXchange standards,
the application of the connected Dublin Core <https://www.dublincore.org/specifications/dublin-core/dcmi-terms/>
and DataCite <https://support.datacite.org/docs/datacite-metadata-schema-44/> standards
preferred by European open science repositories to improve the findability, accessibility,
interoperability and reusability of the datasets.
License: GPL (>= 3)
URL: https://github.com/dataobservatory-eu/dataset
BugReports: https://github.com/dataobservatory-eu/dataset/issues
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.1
Depends:
R (>= 2.10)
LazyData: true
Imports:
assertthat,
ISOcodes,
utils
Suggests:
covr,
declared,
dplyr,
eurostat,
here,
kableExtra,
knitr,
rdflib,
readxl,
rmarkdown,
spelling,
statcodelists,
testthat (>= 3.0.0),
tidyr
VignetteBuilder: knitr
Config/testthat/edition: 3
Language: en-US
You can find the package website on dataset.dataobservatory.eu. The article Motivation: Make Tidy Datasets Easier to Release Exchange and Reuse will eventually be condensed into a JOSS paper. It has a major development dilemma.
Scope
-
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- data retrieval
- data extraction
- data munging
- [x ] data deposition
- data validation and testing
- workflow automation
- version control
- citation management and bibliometrics
- scientific software wrappers
- field and lab reproducibility tools
- database software bindings
- geospatial data
- text analysis
-
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
Open science repositories and analyst comupters are full with datasets that have no provenance, structural or referential data. We believe that whenever possible, metadata should be machine-recorded when possible, and should not be detached from an R object.
There are several R packages that have overalapping goals or functionality todataset
, but they use a different philosophy. When exporting to different files, they should be written as exported, but no sooner, and preferably into the file that contains the data. -
Who is the target audience and what are scientific applications of this package?
This package is intended to give a common foundation to the rOpenGov reproducible research packages. It mainly serves communities that want to reuse statistical data (using the SDMX statistical (meta)data exchange sources, like Eurostat, IMF, World Bank, OECD...) or release new datasets from primary social sciences data that can be integrated into an SDMX compatible API or placed on a knowledge graph. Our main aim is to provide a clear publication workflow to the European open science repository Zenodo, and clear serialization strategies to RDF application.
- Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?
The dataspice package aims to create well-defined and referenced datasets, but follows a different schema and a different publication strategy. The dataset package follows the more restrictive W3C/SDMX "DataSet" definition within the datacube model, which is better suited to synchronize with statistical data sources. Unlike dataset, it uses a manual metadata entry from CSV files. (See the documentation of the dataspice package.)
The dataset
package aims for a higher level of reproducibality, and does not detach the metadata from the R object's attributes (it is aimed to be used in other reproducible research pacakges that will directly record provenance and other transactional metadata into the attributes.) We aim to bind together dataspice
and dataset
by creating export functions to csv files that contain the same metadata that dataspice records. Generally, dataspice seems to be better suited to raw, observational data, while dataset for statistically processed data.
The intended use of dataset
is to start correctly record referential, structural and provenance metadata retrieved by various reproducible science packages that interact with statistical data (such as the rOpenGov packages eurostat and iotables, or the oecd package.
Neither dataset
or dataspice
are very suitable of or documenting social sciences survey data, which are usually held in datasets. Our aim is to connect dataset
, declared and DDIwR to create such datasets with DDI codebook metadata. They will create a stable new foundation of the retroharmonize package to create new, well-documented and harmonized statistical datasets from the observational datasets of social sciences surveys.
The zen4R package provides reproducible export functionality to the zenodo open science repository. Interacting with zen4R
may be intimidating for the casual R user as it uses R6 classes. Our aim to provide an export function that completely wraps the workings of zen4R
when releasing the dataset.
In our experience, while the tidy data standards make reuse more efficient by eliminating unnecessary data processing steps before analysis or placement in a relational database, the application of DataSet definition and the datacube model with the information science metadata standards make reuse more efficient with exchanging and combining the data with other data in different datasets.
- (If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
Yes
-
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
-
Explain reasons for any
pkgcheck
items which your package is unable to pass.
Technical checks
Confirm each of the following by checking the box.
- [x ] I have read the rOpenSci packaging guide.
- [x ] I have read the author guide and I expect to maintain this package for at least 2 years or to find a replacement.
This package:
- [x ] does not violate the Terms of Service of any service it interacts with.
- [ x] has a CRAN and OSI accepted license.
- [ x] contains a README with instructions for installing the development version.
- [ x] includes documentation with examples for all functions, created with roxygen2.
- [x ] contains a vignette with examples of its essential functions and uses.
- [ x] has a test suite.
- has continuous integration, including reporting of test coverage.
Publication options
-
[x ] Do you intend for this package to go on CRAN? -> Yes, I started the CRAN publication process, but opted to stop and get feedback from rOpenSic first
-
Do you intend for this package to go on Bioconductor? -> Don't know.
-
Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
MEE Options
- The package is novel and will be of interest to the broad readership of the journal.
- The manuscript describing the package is no longer than 3000 words.
- You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
- (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
- (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
- (Please do not submit your package separately to Methods in Ecology and Evolution)
Code of conduct
- [ x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.