diff --git a/DESCRIPTION b/DESCRIPTION index 2f932e37..5dbb201b 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -2,8 +2,8 @@ Package: sjmisc Type: Package Encoding: UTF-8 Title: Data Transformation and Labelled Data Utility Functions -Version: 2.3.1.9000 -Date: 2017-03-08 +Version: 2.4.0 +Date: 2017-04-06 Author: Daniel Lüdecke Maintainer: Daniel Lüdecke Description: Collection of miscellaneous utility functions (especially intended @@ -15,7 +15,8 @@ Description: Collection of miscellaneous utility functions (especially intended vectors into factors (and vice versa), or to deal with multiple declared missing values etc. 2) Data transformation tasks like recoding, dichotomizing or grouping variables, setting and replacing missing values. - The data transformation functions also support labelled data. + The data transformation functions also support labelled data, and all integrate + seamlessly into a 'tidyverse'-workflow. License: GPL-3 Depends: R (>= 3.2), diff --git a/NEWS b/NEWS index 9de23f7b..5172a11c 100644 --- a/NEWS +++ b/NEWS @@ -1,4 +1,4 @@ -Version 2.3.1.9000 +Version 2.4.0 ------------------------------------------------------------------------------ General: * Argument `value` in `set_na()` is deprecated. Please use `na` instead. @@ -13,13 +13,16 @@ New functions: * `%nin%` as complement to `%in%`. Changes to functions: -* Functions `rec()`, `dicho()`, `center()`, `std()`, `recode_to()` and `group_var()` get a `append`-argument, to optionally return the original data including the transformed variables as new columns. -* `var_labels()` and `var_rename()` now give a warning if specified variables to rename or relabel do not exist in the data frame. Non-matching variables are removed. +* Functions `rec()`, `dicho()`, `center()`, `std()`, `recode_to()` and `group_var()` get an `append`-argument, to optionally return the original data including the transformed variables as new columns. +* `var_labels()` and `var_rename()` now give a warning if specified variables to rename or relabel do not exist in the data frame. Non-matching variables are ignored. +* If `model.term` does not exist in models, `spread_coef()` now prints the name of non-existing coefficients. +* `find_var()` gets a `fuzzy`-argument to enable fuzzy-matching for search pattern. Bug fixes: * `remove_empty_cols()` returned an empty data frame, when input data frame had no empty columns. * `remove_empty_rows()` returned an empty data frame, when input data frame had no empty rows. -* `add_columns()` and `repair_columns()` in some cases coerced data frames of class `data.frame` with only one column into a vector, which gave an error when binding columns. +* `add_columns()` and `replace_columns()` in some cases coerced data frames of class `data.frame` with only one column into a vector, which gave an error when binding columns. +* Argument `part.dist.match` in `str_pos()` caused an error when being larger than 0. Version 2.3.1 ------------------------------------------------------------------------------ diff --git a/NEWS.md b/NEWS.md index 7ba6fd65..3d934c3f 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,182 +1,185 @@ -# sjmisc 2.3.1.9000 - -## General - -* Argument `value` in `set_na()` is deprecated. Please use `na` instead. -* Argument `recodes` in `rec()` is deprecated. Please use `rec` instead. -* Argument `lab` in `set_label()` is deprecated. Please use `label` instead. -* Argument `value` in `add_labels()` and `replace_labels()` is deprecated. Please use `labels` instead. -* Argument `value` in `ref_lvl()` is deprecated. Please use `lvl` instead. - -## New functions - -* `row_sums()` as wrapper of `rowSums()` to compute row sums, but works within pipe-workflow and with select-helpers for variables, and always returns a tibble.. -* `row_means()` as wrapper of `sjstats::mean_n()` to compute row means, but works within pipe-workflow and with select-helpers for variables, and always returns a tibble.. -* `%nin%` as complement to `%in%`. - -## Changes to functions - -* Functions `rec()`, `dicho()`, `center()`, `std()`, `recode_to()` and `group_var()` get a `append`-argument, to optionally return the original data including the transformed variables as new columns. -* `var_labels()` and `var_rename()` now give a warning if specified variables to rename or relabel do not exist in the data frame. Non-matching variables are removed. - -## Bug fixes - -* `remove_empty_cols()` returned an empty data frame, when input data frame had no empty columns. -* `remove_empty_rows()` returned an empty data frame, when input data frame had no empty rows. -* `add_columns()` and `repair_columns()` in some cases coerced data frames of class `data.frame` with only one column into a vector, which gave an error when binding columns. - -# sjmisc 2.3.1 - -## General - -* Re-exports `magrittr::%>%` (Bob Rudis said more packages should do so). - -## New functions - -* `replace_columns()` to replace variables in one data frame with variables from other data frames. - -## Changes to functions - -* `descr()` gets a `max.length`-argument to shorten variable labels in the output to a specific number of chars. -* `descr()` now also reports the percentage of missing values. -* `set_na()` no longer gives a warning when trying to replace values with `NA` for vectors that completely contained `NA`s. - -## Bug fixes - -* `descr()` now also works on single vectors as data argument. -* Fixed bugs with `write_*()`-functions. - -# sjmisc 2.3.0 - -## General - -* Added package-vignettes. -* Functions were largely revised to work seamlessly within the tidyverse. This may break existing code, but in the long run, consistent api-design makes working with the package more intuitive. See `vignette("design_philosophy", "sjmisc")` for more details. -* `as_labelled()` only converts vectors into `labelled`-class if vector has label attributes. This ensures that data can be properly saved into other formats, e.g. with `write_spss()`. -* The `write_*()`-functions have been revised and should now save data frame with any common classes of vectors (labelled, factor, numeric, atomic...). - -## New functions - -* `center()` and `std()` are moving from package `sjstats` to `sjmisc`. -* `add_columns()` to bind columns of first data frame at the end of all data frames. - -## Changes to functions - -* `frq()` now has the same argument-structure as `flat_table()`. -* Following functions now follow a consistent tidyverse-approach, with the data being the first argument, followed by variable names: `add_labels()`, `replace_labels()`, `remove_labels()`, `count_na()`, `rec()`, `dicho()`, `split_var()`, `drop_labels()`, `fill_labels()`, `group_var()`, `group_labels()`, `ref_lvl()`, `recode_to()`, `replace_na()`, `set_na()` and `set_labels()`. -* `get_values()` now sorts returned values by default, to be consistent with `get_labels()`. -* `spread_coef()` gets arguments `se` and `p.val`, to define whether standard errors and p-values should be included in the return value or not. - -## Bug fixes - -* `merge_df()` did not copy label attributes for data frame with identical variables (that were row-bound). -* `to_value()` did not work for character vectors of class labelled, with empty values (which typically have no value label). - -# sjmisc 2.2.1 - -## Bug fixes - -* The `sort.frq` did not work `frq()`. - -# sjmisc 2.2.0 - -## New functions - -* `zap_inf()` to "clean" vectors from `NaN` and infinite values. -* `descr()` to provide basic descriptive statistics (similar to `describe()` in the psych-package), but including variable labels and usable in pipe-workflows. Also works with grouped data frames. - -## Changes to functions - -* `rec()`, `split_var()` and `dicho()` get an argument `suffix`, to append a suffix to variable (column) names, if applied on a data frame. -* Value labels in `rec()` can now directly be assigned inside the `recodes`-syntax (see 'Details' in `?rec`). -* `find_var()` gets a `as.df`-argument, to return a data frame with matching variables, instead of their column indices only. -* `find_var()` gets a `as.varlab`-argument, to return a "summary" data frame with column number, variable name and variable label. -* `flat_table()` now also accepts grouped data frames. -* `flat_table()` gets a `show.values`-argument, to add values to associated labels in output. -* `frq()` now also accepts grouped data frames. -* `frq()` gets a `weight.by`-argument to weight frequencies. -* `set_na()` can now also find values by their value labels and replace them with NA. -* `set_na()` now removes unused value labels from values that have been replaced with NA. -* The `as.tag`-argument in `set_na()` now defaults to `FALSE`. -* `get_labels()` now always returns labels in sorted order of the associated values. -* `get_labels()` gets a `drop.unused`-argument, to automatically drop labels from values that don't occur in the vector. -* For a named vector as `labels`-argument, `set_labels()` now always sorts labels in sorted order of the associated values. -* `is_empty()` gets a `first.only`-argument, to evaluate either first or all elements of a character vector. - -## Bug fixes - -* `set_na()` did not work on vectors of class `Date` when argument `as.tag = TRUE`. -* `flat_table()` did not show values that had no value labels. Now all categories are shown in the frequency table. -* `rec()` did not properly copy labels of tagged NA values when not all recoded values appeared in the vector. -* `frq()` did not show correct values, when value labels of a vector were not sorted according their values. -* `set_labels()` did not set labels properly for ordered factors. -* `remove_labels()` returned NA-values for value labels (instead of no value labels) when the last value label of a vector was removed. - - -# sjmisc 2.1.0 - -## New functions - -* `find_var()` to find variables in data frames by name or label. -* `var_labels()` as "tidyversed" alternative to `set_label()` to set variable labels. -* `var_rename()` to rename variables. - -## Changes to functions - -* Following functions now get an ellipses-argument `...`, to apply function only to selected variables, but return the complete data frame (thus, overwriting existing variables in a data frame, if requested): `to_factor()`, `to_value()`, `to_label()`, `to_character()`, `to_dummy()`, `zap_labels()`, `zap_unlabelled()`, `zap_na_tags()`. - -## Bug fixes - -* Fixed bug with copying attributes from tibbles for `merge_df()`. -* Fixed wrong argument-description in docs of `frq()`. - -# sjmisc 2.0.1 - -## General - -* Removed package `coin` from Imports. - -## New functions - -* `count_na()` to print a frequency table of tagged NA values. - -## Changes to functions - -* `set_na()` gets a `drop.levels` argument to keep or drop factor levels of values that have been replaced with NA. -* `set_na()` gets a `as.tag` argument to set NA values as regular or tagged NA. - - -# sjmisc 2.0.0 - -## General - -* **sjmisc** now supports _tagged_ `NA` values, a new structure for labelled missing values introduced by the [haven-package](https://cran.r-project.org/package=haven). This means that functions or arguments that are no longer useful, have been removed while other functions dealing with NA values have been largely revised. -* All statistical functions have been removed and are now in a separate package, [sjstats](https://cran.r-project.org/package=sjstats). -* Removed some S3-methods for `labelled`-class, as these are now provided by the haven-package. -* Functions no longer check input for type `matrix`, to avoid conflicts with scaled vectors (that were recognized as matrix and hence treated as data frame). -* `table(*, exclude = NULL)` was changed to `table(*, useNA = "always")`, because of planned changes in upcoming R version 3.4. -* More functions (like `trim()` or `frq()`) now also have data frame- or list-methods. - -## New functions - -* `zap_na_tags()` to turn tagged NA values into regular NA values. -* `spread_coef()` to spread coefficients of multiple fitted models in nested data frames into columns. -* `merge_imputations()` to find the most likely imputed value for a missing value. -* `flat_table()` to print flat (proportional) tables of labelled variables. -* Added `to_character()` method. -* `big_mark()` to format large numbers with big marks. -* `empty_cols()` and `empty_rows()` to find variables or observations with exclusively NA values in a data frame. -* `remove_empty_cols()` and `remove_empty_rows()` to remove variables or observations with exclusively NA values from a data frame. - -## Changes to functions -* `str_contains()` gets a `switch` argument to switch the role of `x` and `pattern`. -* `word_wrap()` coerces vectors to character if necessary. -* `to_label()` gets a `var.label` and `drop.levels` argument, and now preserves variable labels by default. -* Argument `def.value` in `get_label()` now also applies to data frame arguments. -* If factor levels are numeric and factor has value labels, these are used in `to_value()` by default. -* `to_factor()` no longer generates `NA` or `NaN`-levels when converting input into factors. - -## Bug fixes -* `rec()` did not recode values, when these were the first element of a multi-line string of the `recodes` argument. -* `is_empty()` returned `NA` instead of `TRUE` for empty character vectors. -* Fixed bug with erroneous assignment of value labels to subset data when using `copy_labels()` ([#20](https://github.com/strengejacke/sjmisc/issues/20)) +# sjmisc 2.4.0 + +## General + +* Argument `value` in `set_na()` is deprecated. Please use `na` instead. +* Argument `recodes` in `rec()` is deprecated. Please use `rec` instead. +* Argument `lab` in `set_label()` is deprecated. Please use `label` instead. +* Argument `value` in `add_labels()` and `replace_labels()` is deprecated. Please use `labels` instead. +* Argument `value` in `ref_lvl()` is deprecated. Please use `lvl` instead. + +## New functions + +* `row_sums()` as wrapper of `rowSums()` to compute row sums, but works within pipe-workflow and with select-helpers for variables, and always returns a tibble.. +* `row_means()` as wrapper of `sjstats::mean_n()` to compute row means, but works within pipe-workflow and with select-helpers for variables, and always returns a tibble.. +* `%nin%` as complement to `%in%`. + +## Changes to functions + +* Functions `rec()`, `dicho()`, `center()`, `std()`, `recode_to()` and `group_var()` get an `append`-argument, to optionally return the original data including the transformed variables as new columns. +* `var_labels()` and `var_rename()` now give a warning if specified variables to rename or relabel do not exist in the data frame. Non-matching variables are ignored. +* If `model.term` does not exist in models, `spread_coef()` now prints the name of non-existing coefficients. +* `find_var()` gets a `fuzzy`-argument to enable fuzzy-matching for search pattern. + +## Bug fixes + +* `remove_empty_cols()` returned an empty data frame, when input data frame had no empty columns. +* `remove_empty_rows()` returned an empty data frame, when input data frame had no empty rows. +* `add_columns()` and `replace_columns()` in some cases coerced data frames of class `data.frame` with only one column into a vector, which gave an error when binding columns. +* Argument `part.dist.match` in `str_pos()` caused an error when being larger than 0. + +# sjmisc 2.3.1 + +## General + +* Re-exports `magrittr::%>%` (Bob Rudis said more packages should do so). + +## New functions + +* `replace_columns()` to replace variables in one data frame with variables from other data frames. + +## Changes to functions + +* `descr()` gets a `max.length`-argument to shorten variable labels in the output to a specific number of chars. +* `descr()` now also reports the percentage of missing values. +* `set_na()` no longer gives a warning when trying to replace values with `NA` for vectors that completely contained `NA`s. + +## Bug fixes + +* `descr()` now also works on single vectors as data argument. +* Fixed bugs with `write_*()`-functions. + +# sjmisc 2.3.0 + +## General + +* Added package-vignettes. +* Functions were largely revised to work seamlessly within the tidyverse. This may break existing code, but in the long run, consistent api-design makes working with the package more intuitive. See `vignette("design_philosophy", "sjmisc")` for more details. +* `as_labelled()` only converts vectors into `labelled`-class if vector has label attributes. This ensures that data can be properly saved into other formats, e.g. with `write_spss()`. +* The `write_*()`-functions have been revised and should now save data frame with any common classes of vectors (labelled, factor, numeric, atomic...). + +## New functions + +* `center()` and `std()` are moving from package `sjstats` to `sjmisc`. +* `add_columns()` to bind columns of first data frame at the end of all data frames. + +## Changes to functions + +* `frq()` now has the same argument-structure as `flat_table()`. +* Following functions now follow a consistent tidyverse-approach, with the data being the first argument, followed by variable names: `add_labels()`, `replace_labels()`, `remove_labels()`, `count_na()`, `rec()`, `dicho()`, `split_var()`, `drop_labels()`, `fill_labels()`, `group_var()`, `group_labels()`, `ref_lvl()`, `recode_to()`, `replace_na()`, `set_na()` and `set_labels()`. +* `get_values()` now sorts returned values by default, to be consistent with `get_labels()`. +* `spread_coef()` gets arguments `se` and `p.val`, to define whether standard errors and p-values should be included in the return value or not. + +## Bug fixes + +* `merge_df()` did not copy label attributes for data frame with identical variables (that were row-bound). +* `to_value()` did not work for character vectors of class labelled, with empty values (which typically have no value label). + +# sjmisc 2.2.1 + +## Bug fixes + +* The `sort.frq` did not work `frq()`. + +# sjmisc 2.2.0 + +## New functions + +* `zap_inf()` to "clean" vectors from `NaN` and infinite values. +* `descr()` to provide basic descriptive statistics (similar to `describe()` in the psych-package), but including variable labels and usable in pipe-workflows. Also works with grouped data frames. + +## Changes to functions + +* `rec()`, `split_var()` and `dicho()` get an argument `suffix`, to append a suffix to variable (column) names, if applied on a data frame. +* Value labels in `rec()` can now directly be assigned inside the `recodes`-syntax (see 'Details' in `?rec`). +* `find_var()` gets a `as.df`-argument, to return a data frame with matching variables, instead of their column indices only. +* `find_var()` gets a `as.varlab`-argument, to return a "summary" data frame with column number, variable name and variable label. +* `flat_table()` now also accepts grouped data frames. +* `flat_table()` gets a `show.values`-argument, to add values to associated labels in output. +* `frq()` now also accepts grouped data frames. +* `frq()` gets a `weight.by`-argument to weight frequencies. +* `set_na()` can now also find values by their value labels and replace them with NA. +* `set_na()` now removes unused value labels from values that have been replaced with NA. +* The `as.tag`-argument in `set_na()` now defaults to `FALSE`. +* `get_labels()` now always returns labels in sorted order of the associated values. +* `get_labels()` gets a `drop.unused`-argument, to automatically drop labels from values that don't occur in the vector. +* For a named vector as `labels`-argument, `set_labels()` now always sorts labels in sorted order of the associated values. +* `is_empty()` gets a `first.only`-argument, to evaluate either first or all elements of a character vector. + +## Bug fixes + +* `set_na()` did not work on vectors of class `Date` when argument `as.tag = TRUE`. +* `flat_table()` did not show values that had no value labels. Now all categories are shown in the frequency table. +* `rec()` did not properly copy labels of tagged NA values when not all recoded values appeared in the vector. +* `frq()` did not show correct values, when value labels of a vector were not sorted according their values. +* `set_labels()` did not set labels properly for ordered factors. +* `remove_labels()` returned NA-values for value labels (instead of no value labels) when the last value label of a vector was removed. + + +# sjmisc 2.1.0 + +## New functions + +* `find_var()` to find variables in data frames by name or label. +* `var_labels()` as "tidyversed" alternative to `set_label()` to set variable labels. +* `var_rename()` to rename variables. + +## Changes to functions + +* Following functions now get an ellipses-argument `...`, to apply function only to selected variables, but return the complete data frame (thus, overwriting existing variables in a data frame, if requested): `to_factor()`, `to_value()`, `to_label()`, `to_character()`, `to_dummy()`, `zap_labels()`, `zap_unlabelled()`, `zap_na_tags()`. + +## Bug fixes + +* Fixed bug with copying attributes from tibbles for `merge_df()`. +* Fixed wrong argument-description in docs of `frq()`. + +# sjmisc 2.0.1 + +## General + +* Removed package `coin` from Imports. + +## New functions + +* `count_na()` to print a frequency table of tagged NA values. + +## Changes to functions + +* `set_na()` gets a `drop.levels` argument to keep or drop factor levels of values that have been replaced with NA. +* `set_na()` gets a `as.tag` argument to set NA values as regular or tagged NA. + + +# sjmisc 2.0.0 + +## General + +* **sjmisc** now supports _tagged_ `NA` values, a new structure for labelled missing values introduced by the [haven-package](https://cran.r-project.org/package=haven). This means that functions or arguments that are no longer useful, have been removed while other functions dealing with NA values have been largely revised. +* All statistical functions have been removed and are now in a separate package, [sjstats](https://cran.r-project.org/package=sjstats). +* Removed some S3-methods for `labelled`-class, as these are now provided by the haven-package. +* Functions no longer check input for type `matrix`, to avoid conflicts with scaled vectors (that were recognized as matrix and hence treated as data frame). +* `table(*, exclude = NULL)` was changed to `table(*, useNA = "always")`, because of planned changes in upcoming R version 3.4. +* More functions (like `trim()` or `frq()`) now also have data frame- or list-methods. + +## New functions + +* `zap_na_tags()` to turn tagged NA values into regular NA values. +* `spread_coef()` to spread coefficients of multiple fitted models in nested data frames into columns. +* `merge_imputations()` to find the most likely imputed value for a missing value. +* `flat_table()` to print flat (proportional) tables of labelled variables. +* Added `to_character()` method. +* `big_mark()` to format large numbers with big marks. +* `empty_cols()` and `empty_rows()` to find variables or observations with exclusively NA values in a data frame. +* `remove_empty_cols()` and `remove_empty_rows()` to remove variables or observations with exclusively NA values from a data frame. + +## Changes to functions +* `str_contains()` gets a `switch` argument to switch the role of `x` and `pattern`. +* `word_wrap()` coerces vectors to character if necessary. +* `to_label()` gets a `var.label` and `drop.levels` argument, and now preserves variable labels by default. +* Argument `def.value` in `get_label()` now also applies to data frame arguments. +* If factor levels are numeric and factor has value labels, these are used in `to_value()` by default. +* `to_factor()` no longer generates `NA` or `NaN`-levels when converting input into factors. + +## Bug fixes +* `rec()` did not recode values, when these were the first element of a multi-line string of the `recodes` argument. +* `is_empty()` returned `NA` instead of `TRUE` for empty character vectors. +* Fixed bug with erroneous assignment of value labels to subset data when using `copy_labels()` ([#20](https://github.com/strengejacke/sjmisc/issues/20)) diff --git a/R/find_var.R b/R/find_var.R index 14ebdba5..747b630a 100644 --- a/R/find_var.R +++ b/R/find_var.R @@ -1,157 +1,181 @@ -#' @title Find variable by name or label -#' @name find_var -#' -#' @description This functions finds variables in a data frame, which variable -#' names or variable (and value) label attribute match a specific -#' pattern. Regular expression for the pattern is supported. -#' -#' @param data A data frame. -#' @param pattern Character string to be matched in \code{data}. May also be a -#' character vector of length > 1 (see 'Examples'). \code{pattern} is -#' searched for in column names and variable label attributes of -#' \code{data} (see \code{\link{get_label}}). \code{pattern} might also -#' be a regular-expression object, as returned by \code{\link[stringr]{regex}}, -#' or any of \pkg{stringr}'s supported \code{\link[stringr]{modifiers}}. -#' @param ignore.case Logical, whether matching should be case sensitive or not. -#' @param search Character string, indicating where \code{pattern} is sought. -#' Use one of following options: -#' \describe{ -#' \item{\code{"name_label"}}{The default, searches for \code{pattern} in -#' variable names and variable labels.} -#' \item{\code{"name_value"}}{Searches for \code{pattern} in -#' variable names and value labels.} -#' \item{\code{"label_value"}}{Searches for \code{pattern} in -#' variable and value labels.} -#' \item{\code{"name"}}{Searches for \code{pattern} in -#' variable names.} -#' \item{\code{"label"}}{Searches for \code{pattern} in -#' variable labels} -#' \item{\code{"value"}}{Searches for \code{pattern} in -#' value labels.} -#' \item{\code{"all"}}{Searches for \code{pattern} in -#' variable names, variable and value labels.} -#' } -#' @param as.df Logical, if \code{TRUE}, a data frame with matching variables -#' is returned (instead of their column indices). -#' @param as.varlab Logical, if \code{TRUE}, not only column indices, but also -#' variables labels of matching variables are returned (as -#' data frame). -#' -#' @return A named vector with column indices of found variables (variable names -#' are used as names-attribute); if \code{as.df = TRUE}, a tibble -#' with found variables; or, if \code{as.varlab = TRUE}, a tibble -#' with three columns: column number, variable name and variable label. -#' -#' @details This function searches for \code{pattern} in \code{data}'s column names -#' and - for labelled data - in all variable and value labels of \code{data}'s -#' variables (see \code{\link{get_label}} for details on variable labels and -#' labelled data). Search is performed using the -#' \code{\link[stringr]{str_detect}} functions; hence, regular -#' expressions are supported as well, by simply using -#' \code{pattern = stringr::regex(...)}. -#' -#' @examples -#' data(efc) -#' -#' # find variables with "cop" in variable name -#' find_var(efc, "cop") -#' -#' # return tibble with matching variables -#' find_var(efc, "cop", as.df = TRUE) -#' -#' # or return "summary"-tibble with matching variables -#' # and their variable labels -#' find_var(efc, "cop", as.varlab = TRUE) -#' -#' # find variables with "dependency" in names and variable labels -#' find_var(efc, "dependency") -#' get_label(efc$e42dep) -#' -#' # find variables with "level" in names and value labels -#' res <- find_var(efc, "level", search = "name_value", as.df = TRUE) -#' res -#' get_labels(res, attr.only = FALSE) -#' -#' # use sjPlot::view_df() to view results -#' \dontrun{ -#' library(sjPlot) -#' view_df(res)} -#' -#' @importFrom stringr regex coll -#' @importFrom tibble as_tibble -#' @export -find_var <- function(data, - pattern, - ignore.case = TRUE, - search = c("name_label", "name_value", "label_value", "name", "label", "value", "all"), - as.df = FALSE, - as.varlab = FALSE) { - # check valid args - if (!is.data.frame(data)) { - stop("`data` must be a data frame.", call. = F) - } - - # match args - search <- match.arg(search) - - pos1 <- pos2 <- pos3 <- c() - - # search for pattern in variable names - if (search %in% c("name", "name_label", "name_value", "all")) { - # check variable names - if (inherits(pattern, "regex")) - pos1 <- which(stringr::str_detect(colnames(data), pattern)) - else - pos1 <- which(stringr::str_detect(colnames(data), stringr::coll(pattern, ignore_case = ignore.case))) - } - - - # search for pattern in variable labels - if (search %in% c("label", "name_label", "label_value", "all")) { - # get labels and variable names - labels <- get_label(data) - - # check labels - if (inherits(pattern, "regex")) - pos2 <- which(stringr::str_detect(labels, pattern)) - else - pos2 <- which(stringr::str_detect(labels, stringr::coll(pattern, ignore_case = ignore.case))) - } - - # search for pattern in value labels - if (search %in% c("value", "name_value", "label_value", "all")) { - labels <- get_labels(data, attr.only = F) - - # check value labels with regex - if (inherits(pattern, "regex")) { - pos3 <- which(unlist(lapply(labels, function(x) { - any(stringr::str_detect(x, pattern)) - }))) - } else { - pos3 <- which(unlist(lapply(labels, function(x) { - any(stringr::str_detect(x, stringr::coll(pattern, ignore_case = ignore.case))) - }))) - } - } - - # get unqiue variable indices - pos <- unique(c(pos1, pos2, pos3)) - - # return data frame? - if (as.df) { - return(tibble::as_tibble(data[, pos])) - } - - # return variable labels? - if (as.varlab) { - return(tibble::tibble( - col.nr = pos, - var.name = colnames(data)[pos], - var.label = get_label(data[, pos], def.value = colnames(data)[pos]) - )) - } - - # use column names - names(pos) <- colnames(data)[pos] - pos -} +#' @title Find variable by name or label +#' @name find_var +#' +#' @description This functions finds variables in a data frame, which variable +#' names or variable (and value) label attribute match a specific +#' pattern. Regular expression for the pattern is supported. +#' +#' @param data A data frame. +#' @param pattern Character string to be matched in \code{data}. May also be a +#' character vector of length > 1 (see 'Examples'). \code{pattern} is +#' searched for in column names and variable label attributes of +#' \code{data} (see \code{\link{get_label}}). \code{pattern} might also +#' be a regular-expression object, as returned by \code{\link[stringr]{regex}}, +#' or any of \pkg{stringr}'s supported \code{\link[stringr]{modifiers}}. +#' @param ignore.case Logical, whether matching should be case sensitive or not. +#' @param search Character string, indicating where \code{pattern} is sought. +#' Use one of following options: +#' \describe{ +#' \item{\code{"name_label"}}{The default, searches for \code{pattern} in +#' variable names and variable labels.} +#' \item{\code{"name_value"}}{Searches for \code{pattern} in +#' variable names and value labels.} +#' \item{\code{"label_value"}}{Searches for \code{pattern} in +#' variable and value labels.} +#' \item{\code{"name"}}{Searches for \code{pattern} in +#' variable names.} +#' \item{\code{"label"}}{Searches for \code{pattern} in +#' variable labels} +#' \item{\code{"value"}}{Searches for \code{pattern} in +#' value labels.} +#' \item{\code{"all"}}{Searches for \code{pattern} in +#' variable names, variable and value labels.} +#' } +#' @param as.df Logical, if \code{TRUE}, a data frame with matching variables +#' is returned (instead of their column indices). +#' @param as.varlab Logical, if \code{TRUE}, not only column indices, but also +#' variables labels of matching variables are returned (as +#' data frame). +#' @param fuzzy Logical, if \code{TRUE}, "fuzzy matching" (partial and +#' close distance matching) will be used to find \code{pattern} +#' in \code{data} if no exact match was found. \code{\link{str_pos}} +#' is used for fuzzy matching. +#' +#' @return A named vector with column indices of found variables (variable names +#' are used as names-attribute); if \code{as.df = TRUE}, a tibble +#' with found variables; or, if \code{as.varlab = TRUE}, a tibble +#' with three columns: column number, variable name and variable label. +#' +#' @details This function searches for \code{pattern} in \code{data}'s column names +#' and - for labelled data - in all variable and value labels of \code{data}'s +#' variables (see \code{\link{get_label}} for details on variable labels and +#' labelled data). Search is performed using the +#' \code{\link[stringr]{str_detect}} functions; hence, regular +#' expressions are supported as well, by simply using +#' \code{pattern = stringr::regex(...)}. +#' +#' @examples +#' data(efc) +#' +#' # find variables with "cop" in variable name +#' find_var(efc, "cop") +#' +#' # return tibble with matching variables +#' find_var(efc, "cop", as.df = TRUE) +#' +#' # or return "summary"-tibble with matching variables +#' # and their variable labels +#' find_var(efc, "cop", as.varlab = TRUE) +#' +#' # find variables with "dependency" in names and variable labels +#' find_var(efc, "dependency") +#' get_label(efc$e42dep) +#' +#' # find variables with "level" in names and value labels +#' res <- find_var(efc, "level", search = "name_value", as.df = TRUE) +#' res +#' get_labels(res, attr.only = FALSE) +#' +#' # use sjPlot::view_df() to view results +#' \dontrun{ +#' library(sjPlot) +#' view_df(res)} +#' +#' @importFrom stringr regex coll +#' @importFrom tibble as_tibble +#' @export +find_var <- function(data, + pattern, + ignore.case = TRUE, + search = c("name_label", "name_value", "label_value", "name", "label", "value", "all"), + as.df = FALSE, + as.varlab = FALSE, + fuzzy = FALSE) { + # check valid args + if (!is.data.frame(data)) { + stop("`data` must be a data frame.", call. = F) + } + + # match args + search <- match.arg(search) + + pos1 <- pos2 <- pos3 <- c() + + # search for pattern in variable names + if (search %in% c("name", "name_label", "name_value", "all")) { + # check variable names + if (inherits(pattern, "regex")) + pos1 <- which(stringr::str_detect(colnames(data), pattern)) + else + pos1 <- which(stringr::str_detect(colnames(data), stringr::coll(pattern, ignore_case = ignore.case))) + + # if nothing found, find in near distance + if (sjmisc::is_empty(pos1) && fuzzy && !inherits(pattern, "regex")) { + pos1 <- str_pos(search.string = colnames(data), find.term = pattern, part.dist.match = 1) + } + } + + + # search for pattern in variable labels + if (search %in% c("label", "name_label", "label_value", "all")) { + # get labels and variable names + labels <- get_label(data) + + # check labels + if (inherits(pattern, "regex")) + pos2 <- which(stringr::str_detect(labels, pattern)) + else + pos2 <- which(stringr::str_detect(labels, stringr::coll(pattern, ignore_case = ignore.case))) + + # if nothing found, find in near distance + if (sjmisc::is_empty(pos2) && fuzzy && !inherits(pattern, "regex")) { + pos2 <- str_pos(search.string = labels, find.term = pattern, part.dist.match = 1) + } + } + + # search for pattern in value labels + if (search %in% c("value", "name_value", "label_value", "all")) { + labels <- get_labels(data, attr.only = F) + + # check value labels with regex + if (inherits(pattern, "regex")) { + pos3 <- which(unlist(lapply(labels, function(x) { + any(stringr::str_detect(x, pattern)) + }))) + } else { + pos3 <- which(unlist(lapply(labels, function(x) { + any(stringr::str_detect(x, stringr::coll(pattern, ignore_case = ignore.case))) + }))) + } + + # if nothing found, find in near distance + if (sjmisc::is_empty(pos3) && fuzzy && !inherits(pattern, "regex")) { + pos3 <- which(unlist(lapply(labels, function(x) { + str_pos(search.string = x, find.term = pattern, part.dist.match = 1) + }))) + } + } + + # get unique variable indices + pos <- unique(c(pos1, pos2, pos3)) + # remove -1 + pos <- pos[which(pos != -1)] + + # return data frame? + if (as.df) { + return(tibble::as_tibble(data[, pos])) + } + + # return variable labels? + if (as.varlab) { + return(tibble::tibble( + col.nr = pos, + var.name = colnames(data)[pos], + var.label = get_label(data[, pos], def.value = colnames(data)[pos]) + )) + } + + # use column names + names(pos) <- colnames(data)[pos] + pos +} diff --git a/R/rec.R b/R/rec.R index 9e728209..f9dcaf72 100644 --- a/R/rec.R +++ b/R/rec.R @@ -1,432 +1,432 @@ -#' @title Recode variables -#' @name rec -#' -#' @description Recodes values of variables -#' -#' @seealso \code{\link{set_na}} for setting \code{NA} values, \code{\link{replace_na}} -#' to replace \code{NA}'s with specific value, \code{\link{recode_to}} -#' for re-shifting value ranges and \code{\link{ref_lvl}} to change the -#' reference level of (numeric) factors. -#' -#' @param rec String with recode pairs of old and new values. See -#' 'Details' for examples. \code{\link{rec_pattern}} is a convenient -#' function to create recode strings for grouping variables. -#' @param as.num Logical, if \code{TRUE}, return value will be numeric, not a factor. -#' @param var.label Optional string, to set variable label attribute for the -#' returned variable (see vignette \href{../doc/intro_sjmisc.html}{Labelled Data and the sjmisc-Package}). -#' If \code{NULL} (default), variable label attribute of \code{x} will -#' be used (if present). If empty, variable label attributes will be removed. -#' @param val.labels Optional character vector, to set value label attributes -#' of recoded variable (see vignette \href{../doc/intro_sjmisc.html}{Labelled Data and the sjmisc-Package}). -#' If \code{NULL} (default), no value labels will be set. Value labels -#' can also be directly defined in the \code{rec}-syntax, see -#' 'Details'. -#' @param append Logical, if \code{TRUE} and \code{x} is a data frame, -#' \code{x} including the new variables as additional columns is returned; -#' if \code{FALSE} (the default), only the new variables are returned. -#' @param suffix String value, will be appended to variable (column) names of -#' \code{x}, if \code{x} is a data frame. If \code{x} is not a data -#' frame, this argument will be ignored. The default value to suffix -#' column names in a data frame depends on the function call: -#' \itemize{ -#' \item recoded variables (\code{rec()}) will be suffixed with \code{"_r"} -#' \item recoded variables (\code{recode_to()}) will be suffixed with \code{"_r0"} -#' \item dichotomized variables (\code{dicho()}) will be suffixed with \code{"_d"} -#' \item grouped variables (\code{split_var()}) will be suffixed with \code{"_g"} -#' \item grouped variables (\code{group_var()}) will be suffixed with \code{"_gr"} -#' \item standardized variables (\code{std()}) will be suffixed with \code{"_z"} -#' \item centered variables (\code{center()}) will be suffixed with \code{"_c"} -#' } -#' @param recodes Deprecated. Use \code{rec} instead. -#' -#' @inheritParams to_factor -#' -#' @return \code{x} with recoded categories. If \code{x} is a data frame, -#' for \code{append = TRUE}, \code{x} including the recoded variables -#' as new columns is returned; if \code{append = FALSE}, only -#' the recoded variables will be returned. -#' -#' @details The \code{rec} string has following syntax: -#' \describe{ -#' \item{recode pairs}{each recode pair has to be separated by a \code{;}, e.g. \code{rec = "1=1; 2=4; 3=2; 4=3"}} -#' \item{multiple values}{multiple old values that should be recoded into a new single value may be separated with comma, e.g. \code{"1,2=1; 3,4=2"}} -#' \item{value range}{a value range is indicated by a colon, e.g. \code{"1:4=1; 5:8=2"} (recodes all values from 1 to 4 into 1, and from 5 to 8 into 2)} -#' \item{\code{"min"} and \code{"max"}}{minimum and maximum values are indicates by \emph{min} (or \emph{lo}) and \emph{max} (or \emph{hi}), e.g. \code{"min:4=1; 5:max=2"} (recodes all values from minimum values of \code{x} to 4 into 1, and from 5 to maximum values of \code{x} into 2)} -#' \item{\code{"else"}}{all other values, which have not been specified yet, are indicated by \emph{else}, e.g. \code{"3=1; 1=2; else=3"} (recodes 3 into 1, 1 into 2 and all other values into 3)} -#' \item{\code{"copy"}}{the \code{"else"}-token can be combined with \emph{copy}, indicating that all remaining, not yet recoded values should stay the same (are copied from the original value), e.g. \code{"3=1; 1=2; else=copy"} (recodes 3 into 1, 1 into 2 and all other values like 2, 4 or 5 etc. will not be recoded, but copied, see 'Examples')} -#' \item{\code{NA}'s}{\code{\link{NA}} values are allowed both as old and new value, e.g. \code{"NA=1; 3:5=NA"} (recodes all NA into 1, and all values from 3 to 5 into NA in the new variable)} -#' \item{\code{"rev"}}{\code{"rev"} is a special token that reverses the value order (see 'Examples')} -#' \item{direct value labelling}{value labels for new values can be assigned inside the recode pattern by writing the value label in square brackets after defining the new value in a recode pair, e.g. \code{"15:30=1 [young aged]; 31:55=2 [middle aged]; 56:max=3 [old aged]"}. See 'Examples'.} -#' } -#' -#' @note Please note following behaviours of the function: -#' \itemize{ -#' \item the \code{"else"}-token should always be the last argument in the \code{rec}-string. -#' \item Non-matching values will be set to \code{NA}, unless captured by the \code{"else"}-token. -#' \item Tagged NA values (see \code{\link[haven]{tagged_na}}) and their value labels will be preserved when copying NA values to the recoded vector with \code{"else=copy"}. -#' \item Variable label attributes (see, for instance, \code{\link{get_label}}) are preserved (unless changed via \code{var.label}-argument), however, value label attributes are removed (except for \code{"rev"}, where present value labels will be automatically reversed as well). Use \code{val.labels}-argument to add labels for recoded values. -#' \item If \code{x} is a data frame or list-object, all variables should have the same categories resp. value range (else, see second bullet, \code{NA}s are produced). -#' } -#' -#' @examples -#' data(efc) -#' table(efc$e42dep, useNA = "always") -#' -#' # replace NA with 5 -#' table(rec(efc$e42dep, rec = "1=1;2=2;3=3;4=4;NA=5"), useNA = "always") -#' -#' # recode 1 to 2 into 1 and 3 to 4 into 2 -#' table(rec(efc$e42dep, rec = "1,2=1; 3,4=2"), useNA = "always") -#' -#' # or: -#' # rec(efc$e42dep) <- "1,2=1; 3,4=2" -#' # table(efc$e42dep, useNA = "always") -#' -#' # keep value labels. variable label is automatically preserved -#' library(dplyr) -#' efc %>% -#' select(e42dep) %>% -#' rec(rec = "1,2=1; 3,4=2", -#' val.labels = c("low dependency", "high dependency")) %>% -#' str() -#' -#' # works with mutate -#' efc %>% -#' select(e42dep, e17age) %>% -#' mutate(dependency_rev = rec(e42dep, rec = "rev")) %>% -#' head() -#' -#' # recode 1 to 3 into 4 into 2 -#' table(rec(efc$e42dep, rec = "min:3=1; 4=2"), useNA = "always") -#' -#' # recode 2 to 1 and all others into 2 -#' table(rec(efc$e42dep, rec = "2=1; else=2"), useNA = "always") -#' -#' # reverse value order -#' table(rec(efc$e42dep, rec = "rev"), useNA = "always") -#' -#' # recode only selected values, copy remaining -#' table(efc$e15relat) -#' table(rec(efc$e15relat, rec = "1,2,4=1; else=copy")) -#' -#' # recode variables with same category in a data frame -#' head(efc[, 6:9]) -#' head(rec(efc[, 6:9], rec = "1=10;2=20;3=30;4=40")) -#' -#' # recode multiple variables and set value labels via recode-syntax -#' dummy <- rec(efc, c160age, e17age, -#' rec = "15:30=1 [young]; 31:55=2 [middle]; 56:max=3 [old]") -#' frq(dummy) -#' -#' # recode variables with same value-range -#' lapply( -#' rec(efc, c82cop1, c83cop2, c84cop3, rec = "1,2=1; NA=9; else=copy"), -#' table, -#' useNA = "always" -#' ) -#' -#' # recode character vector -#' dummy <- c("M", "F", "F", "X") -#' rec(dummy, rec = "M=Male; F=Female; X=Refused") -#' -#' # recode non-numeric factors -#' data(iris) -#' table(rec(iris, Species, rec = "setosa=huhu; else=copy")) -#' -#' # preserve tagged NAs -#' library(haven) -#' x <- labelled(c(1:3, tagged_na("a", "c", "z"), 4:1), -#' c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"), -#' "Refused" = tagged_na("a"), "Not home" = tagged_na("z"))) -#' # get current value labels -#' x -#' # recode 2 into 5; Values of tagged NAs are preserved -#' rec(x, rec = "2=5;else=copy") -#' na_tag(rec(x, rec = "2=5;else=copy")) -#' -#' # use select-helpers from dplyr-package -#' rec(efc, ~contains("cop"), c161sex:c175empl, rec = "0,1=0; else=1") -#' -#' -#' @export -rec <- function(x, ..., rec, as.num = TRUE, var.label = NULL, val.labels = NULL, append = FALSE, suffix = "_r", recodes) { - - # check deprecated arguments - if (!missing(recodes)) { - message("Argument `recodes` is deprecated. Please use `rec` instead.") - rec <- recodes - } - - # evaluate arguments, generate data - .dots <- match.call(expand.dots = FALSE)$`...` - .dat <- get_dot_data(x, .dots) - - if (is.data.frame(x)) { - - # remember original data, if user wants to bind columns - orix <- tibble::as_tibble(x) - - # iterate variables of data frame - for (i in colnames(.dat)) { - x[[i]] <- rec_helper( - x = .dat[[i]], - recodes = rec, - as.num = as.num, - var.label = var.label, - val.labels = val.labels - ) - } - - # coerce to tibble and select only recoded variables - x <- tibble::as_tibble(x[colnames(.dat)]) - - # add suffix to recoded variables? - if (!is.null(suffix) && !sjmisc::is_empty(suffix)) { - colnames(x) <- sprintf("%s%s", colnames(x), suffix) - } - - # combine data - if (append) x <- dplyr::bind_cols(orix, x) - } else { - x <- rec_helper( - x = .dat, - recodes = rec, - as.num = as.num, - var.label = var.label, - val.labels = val.labels - ) - } - - x -} - - -#' @importFrom stats na.omit -rec_helper <- function(x, recodes, as.num, var.label, val.labels) { - # retrieve variable label - if (is.null(var.label)) - var_lab <- get_label(x) - else - var_lab <- var.label - # do we have any value labels? - val_lab <- val.labels - # remember if NA's have been recoded... - na_recoded <- FALSE - - # drop labels when reversing - if (recodes == "rev") x <- drop_labels(x, drop.na = TRUE) - - # get current NA values - current.na <- get_na(x) - - # do we have a factor with "x"? - if (is.factor(x)) { - # save variable labels before in case we just want - # to reverse the order - if (is.null(val_lab) && recodes == "rev") { - val_lab <- rev(get_labels(x, attr.only = TRUE, include.values = NULL, - include.non.labelled = TRUE, drop.na = TRUE)) - } - - if (is_num_fac(x)) { - # numeric factors coerced to numeric - x <- as.numeric(as.character(x)) - } else { - # non-numeric factors coerced to character - x <- as.character(x) - # non-numeric factors will always be factor again - as.num <- FALSE - } - } - - # retrieve min and max values - min_val <- min(x, na.rm = T) - max_val <- max(x, na.rm = T) - - # do we have special recode-token? - if (recodes == "rev") { - # retrieve unique valus, sorted - ov <- sort(unique(stats::na.omit(as.vector(x)))) - # new values should be reversed order - nv <- rev(ov) - # create recodes-string - recodes <- paste(sprintf("%i=%i", ov, nv), collapse = ";") - # when we simply reverse values, we can keep value labels - if (is.null(val_lab)) { - val_lab <- rev(get_labels(x, attr.only = TRUE, include.values = NULL, - include.non.labelled = TRUE, drop.na = TRUE)) - } - } - - # we allow direct labelling, so extract possible direct labels here - # this piece of code is definitely not the best solution, I bet... - # but it seems to work, and I discovered the regex-pattern by myself :-) - # this function extracts direct value labels from the recodes-pattern and - # creates a named vector with value labels, e.g.: - # "18:23=1 [18to23]; 24:65=2 [24to65]; 66:max=3 [> 65]" - # will return a named vector with value 1 to 3, where the text inside [ and ] - # is used as name for each value - dir.label <- unlist(lapply(strsplit( - unlist(regmatches( - recodes, - gregexpr( - pattern = "=([^\\]]*)\\]", - text = recodes, - perl = T - ) - )), - split = "\\[", perl = T - ), - function(x) { - tmp <- as.numeric(trim(substr(x[1], 2, nchar(x[1])))) - names(tmp) <- trim(substr(x[2], 1, nchar(x[2]) - 1)) - tmp - })) - - # if we found any labels, replace the value label argument - if (!is.null(dir.label) && !sjmisc::is_empty(dir.label)) val_lab <- dir.label - - # remove possible direct labels from recode pattern - recodes <- gsub(pattern = "\\[([^\\[]*)\\]", replacement = "", x = recodes, perl = T) - - # prepare and clean recode string - # retrieve each single recode command - rec_string <- unlist(strsplit(recodes, ";", fixed = TRUE)) - # remove spaces - rec_string <- gsub(" ", "", rec_string, fixed = TRUE) - # remove line breaks - rec_string <- gsub("\n", "", rec_string, fixed = F) - rec_string <- gsub("\r", "", rec_string, fixed = F) - # replace min and max placeholders - rec_string <- gsub("min", as.character(min_val), rec_string, fixed = TRUE) - rec_string <- gsub("lo", as.character(min_val), rec_string, fixed = TRUE) - rec_string <- gsub("max", as.character(max_val), rec_string, fixed = TRUE) - rec_string <- gsub("hi", as.character(max_val), rec_string, fixed = TRUE) - # retrieve all recode-pairs, i.e. all old-value = new-value assignments - rec_pairs <- strsplit(rec_string, "=", fixed = TRUE) - - # check for correct syntax - correct_syntax <- unlist(lapply(rec_pairs, function(r) if (length(r) != 2) r else NULL)) - # found any errors in syntax? - if (!is.null(correct_syntax)) { - stop(sprintf("?Syntax error in argument \"%s\"", paste(correct_syntax, collapse = "=")), call. = F) - } - - # the new, recoded variable - new_var <- rep(-Inf, length(x)) - - # now iterate all recode pairs - # and do each recoding step - for (i in seq_len(length(rec_pairs))) { - # retrieve recode pairs as string, and start with separaring old-values - # at comma separator - old_val_string <- unlist(strsplit(rec_pairs[[i]][1], ",", fixed = TRUE)) - new_val_string <- rec_pairs[[i]][2] - new_val <- c() - - # check if new_val_string is correct syntax - if (new_val_string == "NA") { - # here we have a valid NA specification - new_val <- NA - } else if (new_val_string == "copy") { - # copy all remaining values, i.e. don't recode - # remaining values that have not else been specified - # or recoded. NULL indicates the "copy"-token - new_val <- NULL - } else { - # can new value be converted to numeric? - new_val <- suppressWarnings(as.numeric(new_val_string)) - # if not, assignment is wrong - if (is.na(new_val)) new_val <- new_val_string - } - - # retrieve and check old values - old_val <- c() - for (j in seq_len(length(old_val_string))) { - # copy to shorten code - ovs <- old_val_string[j] - - # check if old_val_string is correct syntax - if (ovs == "NA") { - # here we have a valid NA specification - # add value to vector of old values that - # should be recoded - old_val <- c(old_val, NA) - } else if (ovs == "else") { - # here we have a valid "else" specification - # add all remaining values (in the new variable - # created as "-Inf") to vector that should be recoded - old_val <- -Inf - break - } else if (length(grep(":", ovs, fixed = TRUE)) > 0) { - # this value indicates a range of values to be recoded, because - # we have found a colon. now copy from and to values from range - from <- suppressWarnings(as.numeric(unlist(strsplit(ovs, ":", fixed = T))[1])) - to <- suppressWarnings(as.numeric(unlist(strsplit(ovs, ":", fixed = T))[2])) - # check for valid range values - if (is.na(from) || is.na(to)) { - stop(sprintf("?Syntax error in argument \"%s\"", ovs), call. = F) - } - # add range to vector of old values - old_val <- c(old_val, seq(from, to)) - } else { - # can new value be converted to numeric? - ovn <- suppressWarnings(as.numeric(ovs)) - # if not, assignment is wrong - if (is.na(ovn)) ovn <- ovs - # add old recode values to final vector of values - old_val <- c(old_val, ovn) - } - } - - # now we have all recode values and want - # to replace old with new values... - for (k in seq_len(length(old_val))) { - # check for "else" token - if (is.infinite(old_val[k])) { - # else-token found. we first need to preserve NA, but only, - # if these haven't been copied before - if (!na_recoded) new_var[which(is.na(x))] <- x[which(is.na(x))] - # find replace-indices. since "else"-token has to be - # the last argument in the "recodes"-string, the remaining, - # non-recoded values are still "-Inf". Hence, find positions - # of all not yet recoded values - rep.pos <- which(new_var == -Inf) - # else token found, now check whether we have a "copy" - # token as well. in this case, new_val would be NULL - if (is.null(new_val)) { - # all not yet recodes values in new_var should get - # the values at that position of "x" (the old variable), - # i.e. these values remain unchanged. - new_var[rep.pos] <- x[rep.pos] - } else { - # find all -Inf in new var and replace them with replace value - new_var[rep.pos] <- new_val - } - # check for "NA" token - } else if (is.na(old_val[k])) { - # replace all NA with new value - new_var[which(is.na(x))] <- new_val - # remember that we have recoded NA's. Might be - # important for else-token above. - na_recoded <- TRUE - } else { - # else we have numeric values, which should be replaced - new_var[which(x == old_val[k])] <- new_val - } - } - } - # replace remaining -Inf with NA - if (any(is.infinite(new_var))) new_var[which(new_var == -Inf)] <- NA - # add back NA labels - if (!is.null(current.na) && length(current.na) > 0) { - # add named missings - val_lab <- c(val_lab, current.na) - } - # set back variable and value labels - new_var <- suppressWarnings(set_label(x = new_var, label = var_lab)) - new_var <- suppressWarnings(set_labels(x = new_var, labels = val_lab)) - # return result as factor? - if (!as.num) new_var <- to_factor(new_var) - return(new_var) -} +#' @title Recode variables +#' @name rec +#' +#' @description Recodes values of variables +#' +#' @seealso \code{\link{set_na}} for setting \code{NA} values, \code{\link{replace_na}} +#' to replace \code{NA}'s with specific value, \code{\link{recode_to}} +#' for re-shifting value ranges and \code{\link{ref_lvl}} to change the +#' reference level of (numeric) factors. +#' +#' @param rec String with recode pairs of old and new values. See +#' 'Details' for examples. \code{\link{rec_pattern}} is a convenient +#' function to create recode strings for grouping variables. +#' @param as.num Logical, if \code{TRUE}, return value will be numeric, not a factor. +#' @param var.label Optional string, to set variable label attribute for the +#' returned variable (see vignette \href{../doc/intro_sjmisc.html}{Labelled Data and the sjmisc-Package}). +#' If \code{NULL} (default), variable label attribute of \code{x} will +#' be used (if present). If empty, variable label attributes will be removed. +#' @param val.labels Optional character vector, to set value label attributes +#' of recoded variable (see vignette \href{../doc/intro_sjmisc.html}{Labelled Data and the sjmisc-Package}). +#' If \code{NULL} (default), no value labels will be set. Value labels +#' can also be directly defined in the \code{rec}-syntax, see +#' 'Details'. +#' @param append Logical, if \code{TRUE} and \code{x} is a data frame, +#' \code{x} including the new variables as additional columns is returned; +#' if \code{FALSE} (the default), only the new variables are returned. +#' @param suffix String value, will be appended to variable (column) names of +#' \code{x}, if \code{x} is a data frame. If \code{x} is not a data +#' frame, this argument will be ignored. The default value to suffix +#' column names in a data frame depends on the function call: +#' \itemize{ +#' \item recoded variables (\code{rec()}) will be suffixed with \code{"_r"} +#' \item recoded variables (\code{recode_to()}) will be suffixed with \code{"_r0"} +#' \item dichotomized variables (\code{dicho()}) will be suffixed with \code{"_d"} +#' \item grouped variables (\code{split_var()}) will be suffixed with \code{"_g"} +#' \item grouped variables (\code{group_var()}) will be suffixed with \code{"_gr"} +#' \item standardized variables (\code{std()}) will be suffixed with \code{"_z"} +#' \item centered variables (\code{center()}) will be suffixed with \code{"_c"} +#' } +#' @param recodes Deprecated. Use \code{rec} instead. +#' +#' @inheritParams to_factor +#' +#' @return \code{x} with recoded categories. If \code{x} is a data frame, +#' for \code{append = TRUE}, \code{x} including the recoded variables +#' as new columns is returned; if \code{append = FALSE}, only +#' the recoded variables will be returned. +#' +#' @details The \code{rec} string has following syntax: +#' \describe{ +#' \item{recode pairs}{each recode pair has to be separated by a \code{;}, e.g. \code{rec = "1=1; 2=4; 3=2; 4=3"}} +#' \item{multiple values}{multiple old values that should be recoded into a new single value may be separated with comma, e.g. \code{"1,2=1; 3,4=2"}} +#' \item{value range}{a value range is indicated by a colon, e.g. \code{"1:4=1; 5:8=2"} (recodes all values from 1 to 4 into 1, and from 5 to 8 into 2)} +#' \item{\code{"min"} and \code{"max"}}{minimum and maximum values are indicates by \emph{min} (or \emph{lo}) and \emph{max} (or \emph{hi}), e.g. \code{"min:4=1; 5:max=2"} (recodes all values from minimum values of \code{x} to 4 into 1, and from 5 to maximum values of \code{x} into 2)} +#' \item{\code{"else"}}{all other values, which have not been specified yet, are indicated by \emph{else}, e.g. \code{"3=1; 1=2; else=3"} (recodes 3 into 1, 1 into 2 and all other values into 3)} +#' \item{\code{"copy"}}{the \code{"else"}-token can be combined with \emph{copy}, indicating that all remaining, not yet recoded values should stay the same (are copied from the original value), e.g. \code{"3=1; 1=2; else=copy"} (recodes 3 into 1, 1 into 2 and all other values like 2, 4 or 5 etc. will not be recoded, but copied, see 'Examples')} +#' \item{\code{NA}'s}{\code{\link{NA}} values are allowed both as old and new value, e.g. \code{"NA=1; 3:5=NA"} (recodes all NA into 1, and all values from 3 to 5 into NA in the new variable)} +#' \item{\code{"rev"}}{\code{"rev"} is a special token that reverses the value order (see 'Examples')} +#' \item{direct value labelling}{value labels for new values can be assigned inside the recode pattern by writing the value label in square brackets after defining the new value in a recode pair, e.g. \code{"15:30=1 [young aged]; 31:55=2 [middle aged]; 56:max=3 [old aged]"}. See 'Examples'.} +#' } +#' +#' @note Please note following behaviours of the function: +#' \itemize{ +#' \item the \code{"else"}-token should always be the last argument in the \code{rec}-string. +#' \item Non-matching values will be set to \code{NA}, unless captured by the \code{"else"}-token. +#' \item Tagged NA values (see \code{\link[haven]{tagged_na}}) and their value labels will be preserved when copying NA values to the recoded vector with \code{"else=copy"}. +#' \item Variable label attributes (see, for instance, \code{\link{get_label}}) are preserved (unless changed via \code{var.label}-argument), however, value label attributes are removed (except for \code{"rev"}, where present value labels will be automatically reversed as well). Use \code{val.labels}-argument to add labels for recoded values. +#' \item If \code{x} is a data frame, all variables should have the same categories resp. value range (else, see second bullet, \code{NA}s are produced). +#' } +#' +#' @examples +#' data(efc) +#' table(efc$e42dep, useNA = "always") +#' +#' # replace NA with 5 +#' table(rec(efc$e42dep, rec = "1=1;2=2;3=3;4=4;NA=5"), useNA = "always") +#' +#' # recode 1 to 2 into 1 and 3 to 4 into 2 +#' table(rec(efc$e42dep, rec = "1,2=1; 3,4=2"), useNA = "always") +#' +#' # or: +#' # rec(efc$e42dep) <- "1,2=1; 3,4=2" +#' # table(efc$e42dep, useNA = "always") +#' +#' # keep value labels. variable label is automatically preserved +#' library(dplyr) +#' efc %>% +#' select(e42dep) %>% +#' rec(rec = "1,2=1; 3,4=2", +#' val.labels = c("low dependency", "high dependency")) %>% +#' str() +#' +#' # works with mutate +#' efc %>% +#' select(e42dep, e17age) %>% +#' mutate(dependency_rev = rec(e42dep, rec = "rev")) %>% +#' head() +#' +#' # recode 1 to 3 into 4 into 2 +#' table(rec(efc$e42dep, rec = "min:3=1; 4=2"), useNA = "always") +#' +#' # recode 2 to 1 and all others into 2 +#' table(rec(efc$e42dep, rec = "2=1; else=2"), useNA = "always") +#' +#' # reverse value order +#' table(rec(efc$e42dep, rec = "rev"), useNA = "always") +#' +#' # recode only selected values, copy remaining +#' table(efc$e15relat) +#' table(rec(efc$e15relat, rec = "1,2,4=1; else=copy")) +#' +#' # recode variables with same category in a data frame +#' head(efc[, 6:9]) +#' head(rec(efc[, 6:9], rec = "1=10;2=20;3=30;4=40")) +#' +#' # recode multiple variables and set value labels via recode-syntax +#' dummy <- rec(efc, c160age, e17age, +#' rec = "15:30=1 [young]; 31:55=2 [middle]; 56:max=3 [old]") +#' frq(dummy) +#' +#' # recode variables with same value-range +#' lapply( +#' rec(efc, c82cop1, c83cop2, c84cop3, rec = "1,2=1; NA=9; else=copy"), +#' table, +#' useNA = "always" +#' ) +#' +#' # recode character vector +#' dummy <- c("M", "F", "F", "X") +#' rec(dummy, rec = "M=Male; F=Female; X=Refused") +#' +#' # recode non-numeric factors +#' data(iris) +#' table(rec(iris, Species, rec = "setosa=huhu; else=copy")) +#' +#' # preserve tagged NAs +#' library(haven) +#' x <- labelled(c(1:3, tagged_na("a", "c", "z"), 4:1), +#' c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"), +#' "Refused" = tagged_na("a"), "Not home" = tagged_na("z"))) +#' # get current value labels +#' x +#' # recode 2 into 5; Values of tagged NAs are preserved +#' rec(x, rec = "2=5;else=copy") +#' na_tag(rec(x, rec = "2=5;else=copy")) +#' +#' # use select-helpers from dplyr-package +#' rec(efc, ~contains("cop"), c161sex:c175empl, rec = "0,1=0; else=1") +#' +#' +#' @export +rec <- function(x, ..., rec, as.num = TRUE, var.label = NULL, val.labels = NULL, append = FALSE, suffix = "_r", recodes) { + + # check deprecated arguments + if (!missing(recodes)) { + message("Argument `recodes` is deprecated. Please use `rec` instead.") + rec <- recodes + } + + # evaluate arguments, generate data + .dots <- match.call(expand.dots = FALSE)$`...` + .dat <- get_dot_data(x, .dots) + + if (is.data.frame(x)) { + + # remember original data, if user wants to bind columns + orix <- tibble::as_tibble(x) + + # iterate variables of data frame + for (i in colnames(.dat)) { + x[[i]] <- rec_helper( + x = .dat[[i]], + recodes = rec, + as.num = as.num, + var.label = var.label, + val.labels = val.labels + ) + } + + # coerce to tibble and select only recoded variables + x <- tibble::as_tibble(x[colnames(.dat)]) + + # add suffix to recoded variables? + if (!is.null(suffix) && !sjmisc::is_empty(suffix)) { + colnames(x) <- sprintf("%s%s", colnames(x), suffix) + } + + # combine data + if (append) x <- dplyr::bind_cols(orix, x) + } else { + x <- rec_helper( + x = .dat, + recodes = rec, + as.num = as.num, + var.label = var.label, + val.labels = val.labels + ) + } + + x +} + + +#' @importFrom stats na.omit +rec_helper <- function(x, recodes, as.num, var.label, val.labels) { + # retrieve variable label + if (is.null(var.label)) + var_lab <- get_label(x) + else + var_lab <- var.label + # do we have any value labels? + val_lab <- val.labels + # remember if NA's have been recoded... + na_recoded <- FALSE + + # drop labels when reversing + if (recodes == "rev") x <- drop_labels(x, drop.na = TRUE) + + # get current NA values + current.na <- get_na(x) + + # do we have a factor with "x"? + if (is.factor(x)) { + # save variable labels before in case we just want + # to reverse the order + if (is.null(val_lab) && recodes == "rev") { + val_lab <- rev(get_labels(x, attr.only = TRUE, include.values = NULL, + include.non.labelled = TRUE, drop.na = TRUE)) + } + + if (is_num_fac(x)) { + # numeric factors coerced to numeric + x <- as.numeric(as.character(x)) + } else { + # non-numeric factors coerced to character + x <- as.character(x) + # non-numeric factors will always be factor again + as.num <- FALSE + } + } + + # retrieve min and max values + min_val <- min(x, na.rm = T) + max_val <- max(x, na.rm = T) + + # do we have special recode-token? + if (recodes == "rev") { + # retrieve unique valus, sorted + ov <- sort(unique(stats::na.omit(as.vector(x)))) + # new values should be reversed order + nv <- rev(ov) + # create recodes-string + recodes <- paste(sprintf("%i=%i", ov, nv), collapse = ";") + # when we simply reverse values, we can keep value labels + if (is.null(val_lab)) { + val_lab <- rev(get_labels(x, attr.only = TRUE, include.values = NULL, + include.non.labelled = TRUE, drop.na = TRUE)) + } + } + + # we allow direct labelling, so extract possible direct labels here + # this piece of code is definitely not the best solution, I bet... + # but it seems to work, and I discovered the regex-pattern by myself :-) + # this function extracts direct value labels from the recodes-pattern and + # creates a named vector with value labels, e.g.: + # "18:23=1 [18to23]; 24:65=2 [24to65]; 66:max=3 [> 65]" + # will return a named vector with value 1 to 3, where the text inside [ and ] + # is used as name for each value + dir.label <- unlist(lapply(strsplit( + unlist(regmatches( + recodes, + gregexpr( + pattern = "=([^\\]]*)\\]", + text = recodes, + perl = T + ) + )), + split = "\\[", perl = T + ), + function(x) { + tmp <- as.numeric(trim(substr(x[1], 2, nchar(x[1])))) + names(tmp) <- trim(substr(x[2], 1, nchar(x[2]) - 1)) + tmp + })) + + # if we found any labels, replace the value label argument + if (!is.null(dir.label) && !sjmisc::is_empty(dir.label)) val_lab <- dir.label + + # remove possible direct labels from recode pattern + recodes <- gsub(pattern = "\\[([^\\[]*)\\]", replacement = "", x = recodes, perl = T) + + # prepare and clean recode string + # retrieve each single recode command + rec_string <- unlist(strsplit(recodes, ";", fixed = TRUE)) + # remove spaces + rec_string <- gsub(" ", "", rec_string, fixed = TRUE) + # remove line breaks + rec_string <- gsub("\n", "", rec_string, fixed = F) + rec_string <- gsub("\r", "", rec_string, fixed = F) + # replace min and max placeholders + rec_string <- gsub("min", as.character(min_val), rec_string, fixed = TRUE) + rec_string <- gsub("lo", as.character(min_val), rec_string, fixed = TRUE) + rec_string <- gsub("max", as.character(max_val), rec_string, fixed = TRUE) + rec_string <- gsub("hi", as.character(max_val), rec_string, fixed = TRUE) + # retrieve all recode-pairs, i.e. all old-value = new-value assignments + rec_pairs <- strsplit(rec_string, "=", fixed = TRUE) + + # check for correct syntax + correct_syntax <- unlist(lapply(rec_pairs, function(r) if (length(r) != 2) r else NULL)) + # found any errors in syntax? + if (!is.null(correct_syntax)) { + stop(sprintf("?Syntax error in argument \"%s\"", paste(correct_syntax, collapse = "=")), call. = F) + } + + # the new, recoded variable + new_var <- rep(-Inf, length(x)) + + # now iterate all recode pairs + # and do each recoding step + for (i in seq_len(length(rec_pairs))) { + # retrieve recode pairs as string, and start with separaring old-values + # at comma separator + old_val_string <- unlist(strsplit(rec_pairs[[i]][1], ",", fixed = TRUE)) + new_val_string <- rec_pairs[[i]][2] + new_val <- c() + + # check if new_val_string is correct syntax + if (new_val_string == "NA") { + # here we have a valid NA specification + new_val <- NA + } else if (new_val_string == "copy") { + # copy all remaining values, i.e. don't recode + # remaining values that have not else been specified + # or recoded. NULL indicates the "copy"-token + new_val <- NULL + } else { + # can new value be converted to numeric? + new_val <- suppressWarnings(as.numeric(new_val_string)) + # if not, assignment is wrong + if (is.na(new_val)) new_val <- new_val_string + } + + # retrieve and check old values + old_val <- c() + for (j in seq_len(length(old_val_string))) { + # copy to shorten code + ovs <- old_val_string[j] + + # check if old_val_string is correct syntax + if (ovs == "NA") { + # here we have a valid NA specification + # add value to vector of old values that + # should be recoded + old_val <- c(old_val, NA) + } else if (ovs == "else") { + # here we have a valid "else" specification + # add all remaining values (in the new variable + # created as "-Inf") to vector that should be recoded + old_val <- -Inf + break + } else if (length(grep(":", ovs, fixed = TRUE)) > 0) { + # this value indicates a range of values to be recoded, because + # we have found a colon. now copy from and to values from range + from <- suppressWarnings(as.numeric(unlist(strsplit(ovs, ":", fixed = T))[1])) + to <- suppressWarnings(as.numeric(unlist(strsplit(ovs, ":", fixed = T))[2])) + # check for valid range values + if (is.na(from) || is.na(to)) { + stop(sprintf("?Syntax error in argument \"%s\"", ovs), call. = F) + } + # add range to vector of old values + old_val <- c(old_val, seq(from, to)) + } else { + # can new value be converted to numeric? + ovn <- suppressWarnings(as.numeric(ovs)) + # if not, assignment is wrong + if (is.na(ovn)) ovn <- ovs + # add old recode values to final vector of values + old_val <- c(old_val, ovn) + } + } + + # now we have all recode values and want + # to replace old with new values... + for (k in seq_len(length(old_val))) { + # check for "else" token + if (is.infinite(old_val[k])) { + # else-token found. we first need to preserve NA, but only, + # if these haven't been copied before + if (!na_recoded) new_var[which(is.na(x))] <- x[which(is.na(x))] + # find replace-indices. since "else"-token has to be + # the last argument in the "recodes"-string, the remaining, + # non-recoded values are still "-Inf". Hence, find positions + # of all not yet recoded values + rep.pos <- which(new_var == -Inf) + # else token found, now check whether we have a "copy" + # token as well. in this case, new_val would be NULL + if (is.null(new_val)) { + # all not yet recodes values in new_var should get + # the values at that position of "x" (the old variable), + # i.e. these values remain unchanged. + new_var[rep.pos] <- x[rep.pos] + } else { + # find all -Inf in new var and replace them with replace value + new_var[rep.pos] <- new_val + } + # check for "NA" token + } else if (is.na(old_val[k])) { + # replace all NA with new value + new_var[which(is.na(x))] <- new_val + # remember that we have recoded NA's. Might be + # important for else-token above. + na_recoded <- TRUE + } else { + # else we have numeric values, which should be replaced + new_var[which(x == old_val[k])] <- new_val + } + } + } + # replace remaining -Inf with NA + if (any(is.infinite(new_var))) new_var[which(new_var == -Inf)] <- NA + # add back NA labels + if (!is.null(current.na) && length(current.na) > 0) { + # add named missings + val_lab <- c(val_lab, current.na) + } + # set back variable and value labels + new_var <- suppressWarnings(set_label(x = new_var, label = var_lab)) + new_var <- suppressWarnings(set_labels(x = new_var, labels = val_lab)) + # return result as factor? + if (!as.num) new_var <- to_factor(new_var) + return(new_var) +} diff --git a/R/row_sums.R b/R/row_sums.R index 9ae20ce1..59c84bb0 100644 --- a/R/row_sums.R +++ b/R/row_sums.R @@ -1,130 +1,132 @@ -#' @title Row sums and means for data frames -#' @name row_sums -#' -#' @description \code{row_sums()} simply wraps \code{\link{rowSums}}, while -#' \code{row_means()} simply wraps \code{\link[sjstats]{mean_n}}, -#' however, the argument-structure of both functions is designed -#' to work nicely within a pipe-workflow and allows select-helpers -#' for selecting variables, the default for \code{na.rm} is \code{TRUE}, -#' and the return value is always a tibble (with one variable). -#' -#' @param n May either be -#' \itemize{ -#' \item a numeric value that indicates the amount of valid values per row to calculate the row mean; -#' \item or a value between 0 and 1, indicating a proportion of valid values per row to calculate the row mean (see 'Details'). -#' } -#' If a row's sum of valid values is less than \code{n}, \code{NA} will be returned as row mean value. -#' @param na.rm Logical, \code{TRUE} if missing values should be omitted from -#' the calculations. -#' @param var Name of new the variable with the row sums or means. -#' @inheritParams to_factor -#' -#' @return For \code{row_sums()}, a tibble with one variable: the row sums from -#' \code{x}; for \code{row_means()}, a tibble with one variable: row -#' means from \code{x}. -#' -#' @details For \code{n}, must be a numeric value from \code{0} to \code{ncol(x)}. If -#' a \emph{row} in \code{x} has at least \code{n} non-missing values, the -#' row mean is returned. If \code{n} is a non-integer value from 0 to 1, -#' \code{n} is considered to indicate the proportion of necessary non-missing -#' values per row. E.g., if \code{n = .75}, a row must have at least \code{ncol(x) * n} -#' non-missing values for the row mean to be calculated. See 'Examples'. -#' -#' @examples -#' data(efc) -#' efc %>% row_sums(c82cop1:c90cop9) -#' -#' library(dplyr) -#' row_sums(efc, ~contains("cop")) -#' -#' dat <- data.frame( -#' c1 = c(1,2,NA,4), -#' c2 = c(NA,2,NA,5), -#' c3 = c(NA,4,NA,NA), -#' c4 = c(2,3,7,8), -#' c5 = c(1,7,5,3) -#' ) -#' dat -#' -#' row_means(dat, n = 4) -#' row_means(dat, c1:c4, n = 4) -#' # at least 40% non-missing -#' row_means(dat, c1:c4, n = .4) -#' -#' # create sum-score of COPE-Index, and append to data -#' efc %>% -#' select(c82cop1:c90cop9) %>% -#' row_sums() %>% -#' add_columns(efc) -#' -#' @importFrom tibble as_tibble -#' @export -row_sums <- function(x, ..., na.rm = TRUE, var = "rowsums", append = FALSE) { - # evaluate arguments, generate data - .dots <- match.call(expand.dots = FALSE)$`...` - .dat <- get_dot_data(x, .dots) - - - # remember original data, if user wants to bind columns - orix <- tibble::as_tibble(x) - - if (is.data.frame(x)) { - rs <- rowSums(.dat, na.rm = na.rm) - } else { - stop("`x` must be a data frame.", call. = F) - } - - - # to tibble, and rename variable - rs <- tibble::as_tibble(rs) - colnames(rs) <- var - - # combine data - if (append) rs <- dplyr::bind_cols(orix, rs) - - rs -} - - -#' @rdname row_sums -#' @export -row_means <- function(x, ..., n, var = "rowmeans", append = FALSE) { - # evaluate arguments, generate data - .dots <- match.call(expand.dots = FALSE)$`...` - .dat <- get_dot_data(x, .dots) - - # remember original data, if user wants to bind columns - orix <- tibble::as_tibble(x) - - if (is.data.frame(x)) { - # is 'n' indicating a proportion? - digs <- n %% 1 - if (digs != 0) n <- round(ncol(.dat) * digs) - - # check if we have a data framme with at least two columns - if (ncol(.dat) < 2) { - warning("`x` must be a data frame with at least two columns.", call. = TRUE) - return(NA) - } - - # n may not be larger as df's amount of columns - if (ncol(.dat) < n) { - warning("`n` must be smaller or equal to number of columns in data frame.", call. = TRUE) - return(NA) - } - - rm <- apply(.dat, 1, function(x) ifelse(sum(!is.na(x)) >= n, mean(x, na.rm = TRUE), NA)) - - } else { - stop("`x` must be a data frame.", call. = F) - } - - # to tibble, and rename variable - rm <- tibble::as_tibble(rm) - colnames(rm) <- var - - # combine data - if (append) rm <- dplyr::bind_cols(orix, rm) - - rm -} +#' @title Row sums and means for data frames +#' @name row_sums +#' +#' @description \code{row_sums()} simply wraps \code{\link{rowSums}}, while +#' \code{row_means()} simply wraps \code{\link[sjstats]{mean_n}}, +#' however, the argument-structure of both functions is designed +#' to work nicely within a pipe-workflow and allows select-helpers +#' for selecting variables, the default for \code{na.rm} is \code{TRUE}, +#' and the return value is always a tibble (with one variable). +#' +#' @param n May either be +#' \itemize{ +#' \item a numeric value that indicates the amount of valid values per row to calculate the row mean; +#' \item or a value between 0 and 1, indicating a proportion of valid values per row to calculate the row mean (see 'Details'). +#' } +#' If a row's sum of valid values is less than \code{n}, \code{NA} will be returned as row mean value. +#' @param na.rm Logical, \code{TRUE} if missing values should be omitted from +#' the calculations. +#' @param var Name of new the variable with the row sums or means. +#' +#' @inheritParams to_factor +#' @inheritParams rec +#' +#' @return For \code{row_sums()}, a tibble with one variable: the row sums from +#' \code{x}; for \code{row_means()}, a tibble with one variable: the row +#' means from \code{x}. +#' +#' @details For \code{n}, must be a numeric value from \code{0} to \code{ncol(x)}. If +#' a \emph{row} in \code{x} has at least \code{n} non-missing values, the +#' row mean is returned. If \code{n} is a non-integer value from 0 to 1, +#' \code{n} is considered to indicate the proportion of necessary non-missing +#' values per row. E.g., if \code{n = .75}, a row must have at least \code{ncol(x) * n} +#' non-missing values for the row mean to be calculated. See 'Examples'. +#' +#' @examples +#' data(efc) +#' efc %>% row_sums(c82cop1:c90cop9) +#' +#' library(dplyr) +#' row_sums(efc, ~contains("cop")) +#' +#' dat <- data.frame( +#' c1 = c(1,2,NA,4), +#' c2 = c(NA,2,NA,5), +#' c3 = c(NA,4,NA,NA), +#' c4 = c(2,3,7,8), +#' c5 = c(1,7,5,3) +#' ) +#' dat +#' +#' row_means(dat, n = 4) +#' row_means(dat, c1:c4, n = 4) +#' # at least 40% non-missing +#' row_means(dat, c1:c4, n = .4) +#' +#' # create sum-score of COPE-Index, and append to data +#' efc %>% +#' select(c82cop1:c90cop9) %>% +#' row_sums() %>% +#' add_columns(efc) +#' +#' @importFrom tibble as_tibble +#' @export +row_sums <- function(x, ..., na.rm = TRUE, var = "rowsums", append = FALSE) { + # evaluate arguments, generate data + .dots <- match.call(expand.dots = FALSE)$`...` + .dat <- get_dot_data(x, .dots) + + + # remember original data, if user wants to bind columns + orix <- tibble::as_tibble(x) + + if (is.data.frame(x)) { + rs <- rowSums(.dat, na.rm = na.rm) + } else { + stop("`x` must be a data frame.", call. = F) + } + + + # to tibble, and rename variable + rs <- tibble::as_tibble(rs) + colnames(rs) <- var + + # combine data + if (append) rs <- dplyr::bind_cols(orix, rs) + + rs +} + + +#' @rdname row_sums +#' @export +row_means <- function(x, ..., n, var = "rowmeans", append = FALSE) { + # evaluate arguments, generate data + .dots <- match.call(expand.dots = FALSE)$`...` + .dat <- get_dot_data(x, .dots) + + # remember original data, if user wants to bind columns + orix <- tibble::as_tibble(x) + + if (is.data.frame(x)) { + # is 'n' indicating a proportion? + digs <- n %% 1 + if (digs != 0) n <- round(ncol(.dat) * digs) + + # check if we have a data framme with at least two columns + if (ncol(.dat) < 2) { + warning("`x` must be a data frame with at least two columns.", call. = TRUE) + return(NA) + } + + # n may not be larger as df's amount of columns + if (ncol(.dat) < n) { + warning("`n` must be smaller or equal to number of columns in data frame.", call. = TRUE) + return(NA) + } + + rm <- apply(.dat, 1, function(x) ifelse(sum(!is.na(x)) >= n, mean(x, na.rm = TRUE), NA)) + + } else { + stop("`x` must be a data frame.", call. = F) + } + + # to tibble, and rename variable + rm <- tibble::as_tibble(rm) + colnames(rm) <- var + + # combine data + if (append) rm <- dplyr::bind_cols(orix, rm) + + rm +} diff --git a/R/spread_coef.R b/R/spread_coef.R index 9211cd7e..7ac86b89 100644 --- a/R/spread_coef.R +++ b/R/spread_coef.R @@ -120,10 +120,37 @@ spread_coef <- function(data, model.column, model.term, se, p.val, append = TRUE # check if user just wants a specific model term # if yes, select this, and its p-value if (!sjmisc::is_empty(model.term)) { + # validate model term, i.e. check if coefficient exists in models + tmp <- broom::tidy(data[[model.column]][[1]], effects = "fixed") + + # if term is no valid coefficient name, tell user, and make + # suggestions of possibly meant correct terms + if (model.term %nin% tmp$term) { + + pos <- str_pos(search.string = tmp$term, find.term = model.term, part.dist.match = 1) + + if (pos != -1) { + pos_str <- + sprintf(" Did you mean (one of) `%s`?", paste(tmp$term[pos], collapse = ",")) + } else { + pos_str <- "" + } + + stop( + sprintf( + "`%s` is no valid model term.%s", + model.term, + pos_str + ), + call. = F + ) + } + # select variables for output variables <- "estimate" if (se) variables <- c(variables, "std.error") if (p.val) variables <- c(variables, "p.value") + # iterate list variable dat <- purrr::map_df(data[[model.column]], function(x) { diff --git a/R/str_pos.R b/R/str_pos.R index 6fdbfa78..5eaa22fa 100644 --- a/R/str_pos.R +++ b/R/str_pos.R @@ -1,134 +1,136 @@ -#' @title Find partial matching and close distance elements in strings -#' @name str_pos -#' @description This function finds the element indices of partial matching or similar strings -#' in a character vector. Can be used to find exact or slightly mistyped elements -#' in a string vector. -#' -#' @seealso \code{\link{group_str}} -#' -#' @param search.string Character vector with string elements. -#' @param find.term String that should be matched against the elements of \code{search.string}. -#' @param maxdist Maximum distance between two string elements, which is allowed to treat them -#' as similar or equal. Smaller values mean less tolerance in matching. -#' @param part.dist.match Activates similar matching (close distance strings) for parts (substrings) -#' of the \code{search.string}. Following values are accepted: -#' \itemize{ -#' \item 0 for no partial distance matching -#' \item 1 for one-step matching, which means, only substrings of same length as \code{find.term} are extracted from \code{search.string} matching -#' \item 2 for two-step matching, which means, substrings of same length as \code{find.term} as well as strings with a slightly wider range are extracted from \code{search.string} matching -#' } -#' Default value is 0. See 'Details' for more information. -#' @param show.pbar Logical; f \code{TRUE}, the progress bar is displayed when computing the distance matrix. -#' Default in \code{FALSE}, hence the bar is hidden. -#' -#' @return A numeric vector with index position of elements in \code{search.string} that -#' partially match or are similar to \code{find.term}. Returns \code{-1} if no -#' match was found. -#' -#' @note This function does \emph{not} return the position of a matching string \emph{inside} -#' another string, but the element's index of the \code{search.string} vector, where -#' a (partial) match with \code{find.term} was found. Thus, searching for "abc" in -#' a string "this is abc" will not return 9 (the start position of the substring), -#' but 1 (the element index, which is always 1 if \code{search.string} only has one element). -#' -#' @details For \code{part.dist.match = 1}, a substring of \code{length(find.term)} is extracted -#' from \code{search.string}, starting at position 0 in \code{search.string} until -#' the end of \code{search.string} is reached. Each substring is matched against -#' \code{find.term}, and results with a maximum distance of \code{maxdist} -#' are considered as "matching". If \code{part.dist.match = 2}, the range -#' of the extracted substring is increased by 2, i.e. the extracted substring -#' is two chars longer and so on. -#' -#' @examples -#' \dontrun{ -#' string <- c("Hello", "Helo", "Hole", "Apple", "Ape", "New", "Old", "System", "Systemic") -#' str_pos(string, "hel") # partial match -#' str_pos(string, "stem") # partial match -#' str_pos(string, "R") # no match -#' str_pos(string, "saste") # similarity to "System" -#' -#' # finds two indices, because partial matching now -#' # also applies to "Systemic" -#' str_pos(string, -#' "sytsme", -#' part.dist.match = 1) -#' -#' # finds nothing -#' str_pos("We are Sex Pistols!", "postils") -#' # finds partial matching of similarity -#' str_pos("We are Sex Pistols!", "postils", part.dist.match = 1)} -#' -#' @importFrom stringdist stringdist -#' @importFrom utils txtProgressBar setTxtProgressBar -#' @export -str_pos <- function(search.string, - find.term, - maxdist = 2, - part.dist.match = 0, - show.pbar = FALSE) { - # init return value - indices <- c() - - # find element indices from partial matching of string and find term - pos <- as.numeric(grep(find.term, search.string, ignore.case = T)) - if (length(pos) > 0) indices <- c(indices, pos) - - # find element indices from similar strings - pos <- which(stringdist::stringdist(tolower(find.term), tolower(search.string)) <= maxdist) - if (length(pos) > 0) indices <- c(indices, pos) - - # find element indices from partial similar (distance) - # string matching - if (part.dist.match > 0) { - ftlength <- nchar(find.term) - # create progress bar - if (show.pbar) pb <- utils::txtProgressBar(min = 0, - max = length(search.string), - style = 3) - - # iterate search string vector - for (ssl in seq_len(length(search.string))) { - # retrieve each element of search string vector - # we do this step by step instead of vectorizing - # due to the substring approach - sst <- search.string[ssl] - - # we extract substrings of same length as find.term - # starting from first char of search.string until end - # and try to find similar matches - steps <- nchar(sst) - ftlength + 1 - for (pi in seq_len(steps)) { - # retrieve substring - sust <- trim(substr(sst, pi, pi + ftlength - 1)) - - # find element indices from similar substrings - pos <- which(stringdist::stringdist(tolower(find.term), tolower(sust)) <= maxdist) - if (length(pos) > 0) indices <- c(indices, ssl) - } - if (part.dist.match > 1) { - - # 2nd loop picks longer substrings, because similarity - # may also be present if length of strings differ - # (e.g. "app" and "apple") - steps <- nchar(sst) - ftlength - if (steps > 1) { - for (pi in 2:steps) { - # retrieve substring - sust <- trim(substr(sst, pi - 1, pi + ftlength)) - - # find element indices from similar substrings - pos <- which(stringdist::stringdist(tolower(find.term), tolower(sust)) <= maxdist) - if (length(pos) > 0) indices <- c(indices, ssl) - } - } - } - # update progress bar - if (show.pbar) utils::setTxtProgressBar(pb, ssl) - } - } - if (show.pbar) close(pb) - - # return result - if (length(indices) > 0) return(sort(unique(indices))) - return(-1) -} +#' @title Find partial matching and close distance elements in strings +#' @name str_pos +#' @description This function finds the element indices of partial matching or similar strings +#' in a character vector. Can be used to find exact or slightly mistyped elements +#' in a string vector. +#' +#' @seealso \code{\link{group_str}} +#' +#' @param search.string Character vector with string elements. +#' @param find.term String that should be matched against the elements of \code{search.string}. +#' @param maxdist Maximum distance between two string elements, which is allowed to treat them +#' as similar or equal. Smaller values mean less tolerance in matching. +#' @param part.dist.match Activates similar matching (close distance strings) for parts (substrings) +#' of the \code{search.string}. Following values are accepted: +#' \itemize{ +#' \item 0 for no partial distance matching +#' \item 1 for one-step matching, which means, only substrings of same length as \code{find.term} are extracted from \code{search.string} matching +#' \item 2 for two-step matching, which means, substrings of same length as \code{find.term} as well as strings with a slightly wider range are extracted from \code{search.string} matching +#' } +#' Default value is 0. See 'Details' for more information. +#' @param show.pbar Logical; f \code{TRUE}, the progress bar is displayed when computing the distance matrix. +#' Default in \code{FALSE}, hence the bar is hidden. +#' +#' @return A numeric vector with index position of elements in \code{search.string} that +#' partially match or are similar to \code{find.term}. Returns \code{-1} if no +#' match was found. +#' +#' @note This function does \emph{not} return the position of a matching string \emph{inside} +#' another string, but the element's index of the \code{search.string} vector, where +#' a (partial) match with \code{find.term} was found. Thus, searching for "abc" in +#' a string "this is abc" will not return 9 (the start position of the substring), +#' but 1 (the element index, which is always 1 if \code{search.string} only has one element). +#' +#' @details For \code{part.dist.match = 1}, a substring of \code{length(find.term)} is extracted +#' from \code{search.string}, starting at position 0 in \code{search.string} until +#' the end of \code{search.string} is reached. Each substring is matched against +#' \code{find.term}, and results with a maximum distance of \code{maxdist} +#' are considered as "matching". If \code{part.dist.match = 2}, the range +#' of the extracted substring is increased by 2, i.e. the extracted substring +#' is two chars longer and so on. +#' +#' @examples +#' \dontrun{ +#' string <- c("Hello", "Helo", "Hole", "Apple", "Ape", "New", "Old", "System", "Systemic") +#' str_pos(string, "hel") # partial match +#' str_pos(string, "stem") # partial match +#' str_pos(string, "R") # no match +#' str_pos(string, "saste") # similarity to "System" +#' +#' # finds two indices, because partial matching now +#' # also applies to "Systemic" +#' str_pos(string, +#' "sytsme", +#' part.dist.match = 1) +#' +#' # finds nothing +#' str_pos("We are Sex Pistols!", "postils") +#' # finds partial matching of similarity +#' str_pos("We are Sex Pistols!", "postils", part.dist.match = 1)} +#' +#' @importFrom stringdist stringdist +#' @importFrom utils txtProgressBar setTxtProgressBar +#' @export +str_pos <- function(search.string, + find.term, + maxdist = 2, + part.dist.match = 0, + show.pbar = FALSE) { + # init return value + indices <- c() + + # find element indices from partial matching of string and find term + pos <- as.numeric(grep(find.term, search.string, ignore.case = T)) + if (length(pos) > 0) indices <- c(indices, pos) + + # find element indices from similar strings + pos <- which(stringdist::stringdist(tolower(find.term), tolower(search.string)) <= maxdist) + if (length(pos) > 0) indices <- c(indices, pos) + + # find element indices from partial similar (distance) + # string matching + if (part.dist.match > 0) { + ftlength <- nchar(find.term) + # create progress bar + if (show.pbar) pb <- utils::txtProgressBar(min = 0, + max = length(search.string), + style = 3) + + # iterate search string vector + for (ssl in seq_len(length(search.string))) { + # retrieve each element of search string vector + # we do this step by step instead of vectorizing + # due to the substring approach + sst <- search.string[ssl] + + # we extract substrings of same length as find.term + # starting from first char of search.string until end + # and try to find similar matches + steps <- nchar(sst) - ftlength + 1 + if (steps > 0) { + for (pi in seq_len(steps)) { + # retrieve substring + sust <- trim(substr(sst, pi, pi + ftlength - 1)) + + # find element indices from similar substrings + pos <- which(stringdist::stringdist(tolower(find.term), tolower(sust)) <= maxdist) + if (length(pos) > 0) indices <- c(indices, ssl) + } + } + if (part.dist.match > 1) { + + # 2nd loop picks longer substrings, because similarity + # may also be present if length of strings differ + # (e.g. "app" and "apple") + steps <- nchar(sst) - ftlength + if (steps > 1) { + for (pi in 2:steps) { + # retrieve substring + sust <- trim(substr(sst, pi - 1, pi + ftlength)) + + # find element indices from similar substrings + pos <- which(stringdist::stringdist(tolower(find.term), tolower(sust)) <= maxdist) + if (length(pos) > 0) indices <- c(indices, ssl) + } + } + } + # update progress bar + if (show.pbar) utils::setTxtProgressBar(pb, ssl) + } + } + if (show.pbar) close(pb) + + # return result + if (length(indices) > 0) return(sort(unique(indices))) + return(-1) +} diff --git a/R/to_factor.R b/R/to_factor.R index b7f5a08e..688329be 100644 --- a/R/to_factor.R +++ b/R/to_factor.R @@ -4,15 +4,15 @@ #' @description This function converts a variable into a factor, but preserves #' variable and value label attributes. See 'Examples'. #' -#' @seealso \code{\link{to_value}} to convert a factor into a numeric value and -#' \code{\link{to_label}} to convert a value into a factor with labelled +#' @seealso \code{\link{to_value}} to convert a factor into a numeric vector and +#' \code{\link{to_label}} to convert a vector into a factor with labelled #' factor levels. #' #' @param x A vector or data frame. -#' @param ... Optional, unquoted names of variables. Required, if \code{x} is -#' a data frame (and no vector) and only selected variables -#' from \code{x} should be processed. You may also use functions like -#' \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +#' @param ... Optional, unquoted names of variables that should be selected for +#' further processing. Required, if \code{x} is a data frame (and no +#' vector) and only selected variables from \code{x} should be processed. +#' You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. #' The latter must be stated as formula (i.e. beginning with \code{~}). #' See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}. #' @param add.non.labelled Logical, if \code{TRUE}, non-labelled values also @@ -23,10 +23,10 @@ #' will become the reference level. See \code{\link{ref_lvl}} for #' details. #' -#' @return A factor variable, including variable and value labels. If \code{x} +#' @return A factor, including variable and value labels. If \code{x} #' is a data frame, the complete data frame \code{x} will be returned, #' where variables specified in \code{...} are coerced -#' to factor variables (including variable and value labels); +#' to factors (including variable and value labels); #' if \code{...} is not specified, applies to all variables in the #' data frame. #' diff --git a/R/to_label.R b/R/to_label.R index 9756988b..a5436836 100644 --- a/R/to_label.R +++ b/R/to_label.R @@ -1,306 +1,306 @@ -#' @title Convert variable into factor with associated value labels -#' @name to_label -#' -#' @description This function converts (replaces) variable values (also of factors -#' or character vectors) with their associated value labels. Might -#' be helpful for factor variables. -#' For instance, if you have a Gender variable with 0/1 value, and associated -#' labels are male/female, this function would convert all 0 to male and -#' all 1 to female and returns the new variable as factor. -#' -#' @param add.non.labelled logical, if \code{TRUE}, values without associated -#' value label will also be converted to labels (as is). See 'Examples'. -#' @param prefix Logical, if \code{TRUE}, the value labels used as factor levels -#' or character values will be prefixed with their associated values. See 'Examples'. -#' @param drop.na Logical, if \code{TRUE}, tagged \code{NA} values with value labels -#' will be converted to regular NA's. Else, tagged \code{NA} values will be replaced -#' with their value labels. See 'Examples' and \code{\link{get_na}}. -#' @param drop.levels Logical, if \code{TRUE}, unused factor levels will be -#' dropped (i.e. \code{\link{droplevels}} will be applied before returning -#' the result). -#' -#' @inheritParams to_factor -#' @inheritParams rec -#' -#' @return A factor variable with the associated value labels as factor levels. If \code{x} -#' is a data frame, the complete data frame \code{x} will be returned, -#' where variables specified in \code{...} are coerced to factors; -#' if \code{...} is not specified, applies to all variables in the -#' data frame. -#' -#' @note Value label attributes (see \code{\link{get_labels}}) -#' will be removed when converting variables to factors. -#' -#' @details See 'Details' in \code{\link{get_na}}. -#' -#' @examples -#' data(efc) -#' print(get_labels(efc)['c161sex']) -#' head(efc$c161sex) -#' head(to_label(efc$c161sex)) -#' -#' print(get_labels(efc)['e42dep']) -#' table(efc$e42dep) -#' table(to_label(efc$e42dep)) -#' -#' head(efc$e42dep) -#' head(to_label(efc$e42dep)) -#' -#' # structure of numeric values won't be changed -#' # by this function, it only applies to labelled vectors -#' # (typically categorical or factor variables) -#' str(efc$e17age) -#' str(to_label(efc$e17age)) -#' -#' -#' # factor with non-numeric levels -#' to_label(factor(c("a", "b", "c"))) -#' -#' # factor with non-numeric levels, prefixed -#' x <- factor(c("a", "b", "c")) -#' x <- set_labels(x, labels = c("ape", "bear", "cat")) -#' to_label(x, prefix = TRUE) -#' -#' -#' # create vector -#' x <- c(1, 2, 3, 2, 4, NA) -#' # add less labels than values -#' x <- set_labels(x, -#' labels = c("yes", "maybe", "no"), -#' force.labels = FALSE, -#' force.values = FALSE) -#' # convert to label w/o non-labelled values -#' to_label(x) -#' # convert to label, including non-labelled values -#' to_label(x, add.non.labelled = TRUE) -#' -#' -#' # create labelled integer, with missing flag -#' library(haven) -#' x <- labelled(c(1:3, tagged_na("a", "c", "z"), 4:1, 2:3), -#' c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"), -#' "Refused" = tagged_na("a"), "Not home" = tagged_na("z"))) -#' # to labelled factor, with missing labels -#' to_label(x, drop.na = FALSE) -#' # to labelled factor, missings removed -#' to_label(x, drop.na = TRUE) -#' # keep missings, and use non-labelled values as well -#' to_label(x, add.non.labelled = TRUE, drop.na = FALSE) -#' -#' -#' # convert labelled character to factor -#' dummy <- c("M", "F", "F", "X") -#' dummy <- set_labels( -#' dummy, -#' labels = c(`M` = "Male", `F` = "Female", `X` = "Refused") -#' ) -#' get_labels(dummy,, "p") -#' to_label(dummy) -#' -#' # drop unused factor levels, but preserve variable label -#' x <- factor(c("a", "b", "c"), levels = c("a", "b", "c", "d")) -#' x <- set_labels(x, labels = c("ape", "bear", "cat")) -#' set_label(x) <- "A factor!" -#' x -#' to_label(x, drop.levels = TRUE) -#' -#' # change variable label -#' to_label(x, var.label = "New variable label!", drop.levels = TRUE) -#' -#' -#' # easily coerce specific variables in a data frame to factor -#' # and keep other variables, with their class preserved -#' to_label(efc, e42dep, e16sex, c172code) -#' -#' @export -to_label <- function(x, ..., add.non.labelled = FALSE, prefix = FALSE, var.label = NULL, drop.na = TRUE, drop.levels = FALSE) { - # evaluate arguments, generate data - .dots <- match.call(expand.dots = FALSE)$`...` - .dat <- get_dot_data(x, .dots) - - if (is.data.frame(x)) { - # iterate variables of data frame - for (i in colnames(.dat)) { - x[[i]] <- to_label_helper(.dat[[i]], add.non.labelled, prefix, var.label, drop.na, drop.levels) - } - # coerce to tibble - x <- tibble::as_tibble(x) - } else { - x <- to_label_helper(.dat, add.non.labelled, prefix, var.label, drop.na, drop.levels) - } - - x -} - -#' @importFrom haven na_tag -to_label_helper <- function(x, add.non.labelled, prefix, var.label, drop.na, drop.levels) { - # prefix labels? - if (prefix) - iv <- "p" - else - iv <- 0 - # retrieve variable label - if (is.null(var.label)) - var_lab <- get_label(x) - else - var_lab <- var.label - # keep missings? - if (!drop.na) { - # get NA - current.na <- get_na(x) - # any NA? - if (!is.null(current.na)) { - # we have to set all NA labels at once, else NA loses tag - # so we prepare a dummy label-vector, where we copy all different - # NA labels to `x` afterwards - dummy_na <- rep("", times = length(x)) - # iterare NA - for (i in seq_len(length(current.na))) { - dummy_na[haven::na_tag(x) == haven::na_tag(current.na[i])] <- names(current.na)[i] - } - x[haven::is_tagged_na(x)] <- dummy_na[haven::is_tagged_na(x)] - } - } else { - # in case x has tagged NA's we need to be sure to convert - # those into regular NA's, because else saving would not work - x[is.na(x)] <- NA - } - # get value labels - vl <- get_labels(x, attr.only = TRUE, include.values = iv, - include.non.labelled = add.non.labelled, - drop.na = drop.na) - # check if we have any labels, else - # return variable "as is" - if (!is.null(vl)) { - # get associated values for value labels - vnn <- get_labels(x, attr.only = TRUE, include.values = "n", - include.non.labelled = add.non.labelled, - drop.na = drop.na) - - # convert to numeric - vn <- suppressWarnings(as.numeric(names(vnn))) - # where some values non-numeric? if yes, - # use value names as character values - if (anyNA(vn)) vn <- names(vnn) - - # replace values with labels - if (is.factor(x)) { - # more levels than labels? - remain_labels <- levels(x)[!levels(x) %in% vn] - # set new levels - levels(x) <- c(vl, remain_labels) - # remove attributes - x <- remove_all_labels(x) - } else { - for (i in seq_len(length(vl))) x[x == vn[i]] <- vl[i] - # to factor - x <- factor(x, levels = unique(vl)) - } - } - # drop unused levels? - if (drop.levels) x <- droplevels(x) - # set back variable labels - if (!is.null(var_lab)) x <- suppressWarnings(set_label(x, label = var_lab)) - # return as factor - return(x) -} - - -#' @title Convert variable into character vector and replace values with associated value labels -#' @name to_character -#' -#' @description This function converts (replaces) variable values (also of factors -#' or character vectors) with their associated value labels and returns -#' them as character vector. This is just a convenient wrapper for -#' \code{as.character(to_label(x))}. -#' -#' @inheritParams to_label -#' -#' @note Value and variable label attributes (see, for instance, \code{\link{get_labels}} -#' or \code{\link{set_labels}}) will be removed when converting variables to factors. -#' -#' @return A character vector with the associated value labels as values. If \code{x} -#' is a data frame, the complete data frame \code{x} will be returned, -#' where variables specified in \code{...} are coerced -#' to character variables; -#' if \code{...} is not specified, applies to all variables in the -#' data frame. -#' -#' @details See 'Details' in \code{\link{get_na}}. -#' -#' @examples -#' data(efc) -#' print(get_labels(efc)['c161sex']) -#' head(efc$c161sex) -#' head(to_character(efc$c161sex)) -#' -#' print(get_labels(efc)['e42dep']) -#' table(efc$e42dep) -#' table(to_character(efc$e42dep)) -#' -#' head(efc$e42dep) -#' head(to_character(efc$e42dep)) -#' -#' # numeric values w/o value labels will also be converted into character -#' str(efc$e17age) -#' str(to_character(efc$e17age)) -#' -#' -#' # factor with non-numeric levels, non-prefixed and prefixed -#' x <- factor(c("a", "b", "c")) -#' x <- set_labels(x, labels = c("ape", "bear", "cat")) -#' -#' to_character(x, prefix = FALSE) -#' to_character(x, prefix = TRUE) -#' -#' -#' # create vector -#' x <- c(1, 2, 3, 2, 4, NA) -#' # add less labels than values -#' x <- set_labels(x, -#' labels = c("yes", "maybe", "no"), -#' force.labels = FALSE, -#' force.values = FALSE) -#' # convert to character w/o non-labelled values -#' to_character(x) -#' # convert to character, including non-labelled values -#' to_character(x, add.non.labelled = TRUE) -#' -#' -#' # create labelled integer, with missing flag -#' library(haven) -#' x <- labelled(c(1:3, tagged_na("a", "c", "z"), 4:1, 2:3), -#' c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"), -#' "Refused" = tagged_na("a"), "Not home" = tagged_na("z"))) -#' # to character, with missing labels -#' to_character(x, drop.na = FALSE) -#' # to character, missings removed -#' to_character(x, drop.na = TRUE) -#' # keep missings, and use non-labelled values as well -#' to_character(x, add.non.labelled = TRUE, drop.na = FALSE) -#' -#' -#' # easily coerce specific variables in a data frame to character -#' # and keep other variables, with their class preserved -#' to_character(efc, e42dep, e16sex, c172code) -#' -#' @export -to_character <- function(x, ..., add.non.labelled = FALSE, prefix = FALSE, var.label = NULL, drop.na = TRUE, drop.levels = FALSE) { - # evaluate arguments, generate data - .dots <- match.call(expand.dots = FALSE)$`...` - .dat <- get_dot_data(x, .dots) - - if (is.data.frame(x)) { - - # iterate variables of data frame - for (i in colnames(.dat)) { - x[[i]] <- as.character(to_label_helper(.dat[[i]], add.non.labelled, prefix, var.label, drop.na, drop.levels)) - } - # coerce to tibble - x <- tibble::as_tibble(x) - } else { - x <- as.character(to_label_helper(.dat, add.non.labelled, prefix, var.label, drop.na, drop.levels)) - } - - x -} +#' @title Convert variable into factor with associated value labels +#' @name to_label +#' +#' @description This function converts (replaces) values of a variable (also of factors +#' or character vectors) with their associated value labels. Might +#' be helpful for factor variables. +#' For instance, if you have a Gender variable with 0/1 value, and associated +#' labels are male/female, this function would convert all 0 to male and +#' all 1 to female and returns the new variable as factor. +#' +#' @param add.non.labelled Logical, if \code{TRUE}, values without associated +#' value label will also be converted to labels (as is). See 'Examples'. +#' @param prefix Logical, if \code{TRUE}, the value labels used as factor levels +#' or character values will be prefixed with their associated values. See 'Examples'. +#' @param drop.na Logical, if \code{TRUE}, tagged \code{NA} values with value labels +#' will be converted to regular NA's. Else, tagged \code{NA} values will be replaced +#' with their value labels. See 'Examples' and \code{\link{get_na}}. +#' @param drop.levels Logical, if \code{TRUE}, unused factor levels will be +#' dropped (i.e. \code{\link{droplevels}} will be applied before returning +#' the result). +#' +#' @inheritParams to_factor +#' @inheritParams rec +#' +#' @return A factor with the associated value labels as factor levels. If \code{x} +#' is a data frame, the complete data frame \code{x} will be returned, +#' where variables specified in \code{...} are coerced to factors; +#' if \code{...} is not specified, applies to all variables in the +#' data frame. +#' +#' @note Value label attributes (see \code{\link{get_labels}}) +#' will be removed when converting variables to factors. +#' +#' @details See 'Details' in \code{\link{get_na}}. +#' +#' @examples +#' data(efc) +#' print(get_labels(efc)['c161sex']) +#' head(efc$c161sex) +#' head(to_label(efc$c161sex)) +#' +#' print(get_labels(efc)['e42dep']) +#' table(efc$e42dep) +#' table(to_label(efc$e42dep)) +#' +#' head(efc$e42dep) +#' head(to_label(efc$e42dep)) +#' +#' # structure of numeric values won't be changed +#' # by this function, it only applies to labelled vectors +#' # (typically categorical or factor variables) +#' str(efc$e17age) +#' str(to_label(efc$e17age)) +#' +#' +#' # factor with non-numeric levels +#' to_label(factor(c("a", "b", "c"))) +#' +#' # factor with non-numeric levels, prefixed +#' x <- factor(c("a", "b", "c")) +#' x <- set_labels(x, labels = c("ape", "bear", "cat")) +#' to_label(x, prefix = TRUE) +#' +#' +#' # create vector +#' x <- c(1, 2, 3, 2, 4, NA) +#' # add less labels than values +#' x <- set_labels(x, +#' labels = c("yes", "maybe", "no"), +#' force.labels = FALSE, +#' force.values = FALSE) +#' # convert to label w/o non-labelled values +#' to_label(x) +#' # convert to label, including non-labelled values +#' to_label(x, add.non.labelled = TRUE) +#' +#' +#' # create labelled integer, with missing flag +#' library(haven) +#' x <- labelled(c(1:3, tagged_na("a", "c", "z"), 4:1, 2:3), +#' c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"), +#' "Refused" = tagged_na("a"), "Not home" = tagged_na("z"))) +#' # to labelled factor, with missing labels +#' to_label(x, drop.na = FALSE) +#' # to labelled factor, missings removed +#' to_label(x, drop.na = TRUE) +#' # keep missings, and use non-labelled values as well +#' to_label(x, add.non.labelled = TRUE, drop.na = FALSE) +#' +#' +#' # convert labelled character to factor +#' dummy <- c("M", "F", "F", "X") +#' dummy <- set_labels( +#' dummy, +#' labels = c(`M` = "Male", `F` = "Female", `X` = "Refused") +#' ) +#' get_labels(dummy,, "p") +#' to_label(dummy) +#' +#' # drop unused factor levels, but preserve variable label +#' x <- factor(c("a", "b", "c"), levels = c("a", "b", "c", "d")) +#' x <- set_labels(x, labels = c("ape", "bear", "cat")) +#' set_label(x) <- "A factor!" +#' x +#' to_label(x, drop.levels = TRUE) +#' +#' # change variable label +#' to_label(x, var.label = "New variable label!", drop.levels = TRUE) +#' +#' +#' # easily coerce specific variables in a data frame to factor +#' # and keep other variables, with their class preserved +#' to_label(efc, e42dep, e16sex, c172code) +#' +#' @export +to_label <- function(x, ..., add.non.labelled = FALSE, prefix = FALSE, var.label = NULL, drop.na = TRUE, drop.levels = FALSE) { + # evaluate arguments, generate data + .dots <- match.call(expand.dots = FALSE)$`...` + .dat <- get_dot_data(x, .dots) + + if (is.data.frame(x)) { + # iterate variables of data frame + for (i in colnames(.dat)) { + x[[i]] <- to_label_helper(.dat[[i]], add.non.labelled, prefix, var.label, drop.na, drop.levels) + } + # coerce to tibble + x <- tibble::as_tibble(x) + } else { + x <- to_label_helper(.dat, add.non.labelled, prefix, var.label, drop.na, drop.levels) + } + + x +} + +#' @importFrom haven na_tag +to_label_helper <- function(x, add.non.labelled, prefix, var.label, drop.na, drop.levels) { + # prefix labels? + if (prefix) + iv <- "p" + else + iv <- 0 + # retrieve variable label + if (is.null(var.label)) + var_lab <- get_label(x) + else + var_lab <- var.label + # keep missings? + if (!drop.na) { + # get NA + current.na <- get_na(x) + # any NA? + if (!is.null(current.na)) { + # we have to set all NA labels at once, else NA loses tag + # so we prepare a dummy label-vector, where we copy all different + # NA labels to `x` afterwards + dummy_na <- rep("", times = length(x)) + # iterare NA + for (i in seq_len(length(current.na))) { + dummy_na[haven::na_tag(x) == haven::na_tag(current.na[i])] <- names(current.na)[i] + } + x[haven::is_tagged_na(x)] <- dummy_na[haven::is_tagged_na(x)] + } + } else { + # in case x has tagged NA's we need to be sure to convert + # those into regular NA's, because else saving would not work + x[is.na(x)] <- NA + } + # get value labels + vl <- get_labels(x, attr.only = TRUE, include.values = iv, + include.non.labelled = add.non.labelled, + drop.na = drop.na) + # check if we have any labels, else + # return variable "as is" + if (!is.null(vl)) { + # get associated values for value labels + vnn <- get_labels(x, attr.only = TRUE, include.values = "n", + include.non.labelled = add.non.labelled, + drop.na = drop.na) + + # convert to numeric + vn <- suppressWarnings(as.numeric(names(vnn))) + # where some values non-numeric? if yes, + # use value names as character values + if (anyNA(vn)) vn <- names(vnn) + + # replace values with labels + if (is.factor(x)) { + # more levels than labels? + remain_labels <- levels(x)[!levels(x) %in% vn] + # set new levels + levels(x) <- c(vl, remain_labels) + # remove attributes + x <- remove_all_labels(x) + } else { + for (i in seq_len(length(vl))) x[x == vn[i]] <- vl[i] + # to factor + x <- factor(x, levels = unique(vl)) + } + } + # drop unused levels? + if (drop.levels) x <- droplevels(x) + # set back variable labels + if (!is.null(var_lab)) x <- suppressWarnings(set_label(x, label = var_lab)) + # return as factor + return(x) +} + + +#' @title Convert variable into character vector and replace values with associated value labels +#' @name to_character +#' +#' @description This function converts (replaces) variable values (also of factors +#' or character vectors) with their associated value labels and returns +#' them as character vector. This is just a convenient wrapper for +#' \code{as.character(to_label(x))}. +#' +#' @inheritParams to_label +#' +#' @note Value and variable label attributes (see, for instance, \code{\link{get_labels}} +#' or \code{\link{set_labels}}) will be removed when converting variables to factors. +#' +#' @return A character vector with the associated value labels as values. If \code{x} +#' is a data frame, the complete data frame \code{x} will be returned, +#' where variables specified in \code{...} are coerced +#' to character variables; +#' if \code{...} is not specified, applies to all variables in the +#' data frame. +#' +#' @details See 'Details' in \code{\link{get_na}}. +#' +#' @examples +#' data(efc) +#' print(get_labels(efc)['c161sex']) +#' head(efc$c161sex) +#' head(to_character(efc$c161sex)) +#' +#' print(get_labels(efc)['e42dep']) +#' table(efc$e42dep) +#' table(to_character(efc$e42dep)) +#' +#' head(efc$e42dep) +#' head(to_character(efc$e42dep)) +#' +#' # numeric values w/o value labels will also be converted into character +#' str(efc$e17age) +#' str(to_character(efc$e17age)) +#' +#' +#' # factor with non-numeric levels, non-prefixed and prefixed +#' x <- factor(c("a", "b", "c")) +#' x <- set_labels(x, labels = c("ape", "bear", "cat")) +#' +#' to_character(x, prefix = FALSE) +#' to_character(x, prefix = TRUE) +#' +#' +#' # create vector +#' x <- c(1, 2, 3, 2, 4, NA) +#' # add less labels than values +#' x <- set_labels(x, +#' labels = c("yes", "maybe", "no"), +#' force.labels = FALSE, +#' force.values = FALSE) +#' # convert to character w/o non-labelled values +#' to_character(x) +#' # convert to character, including non-labelled values +#' to_character(x, add.non.labelled = TRUE) +#' +#' +#' # create labelled integer, with missing flag +#' library(haven) +#' x <- labelled(c(1:3, tagged_na("a", "c", "z"), 4:1, 2:3), +#' c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"), +#' "Refused" = tagged_na("a"), "Not home" = tagged_na("z"))) +#' # to character, with missing labels +#' to_character(x, drop.na = FALSE) +#' # to character, missings removed +#' to_character(x, drop.na = TRUE) +#' # keep missings, and use non-labelled values as well +#' to_character(x, add.non.labelled = TRUE, drop.na = FALSE) +#' +#' +#' # easily coerce specific variables in a data frame to character +#' # and keep other variables, with their class preserved +#' to_character(efc, e42dep, e16sex, c172code) +#' +#' @export +to_character <- function(x, ..., add.non.labelled = FALSE, prefix = FALSE, var.label = NULL, drop.na = TRUE, drop.levels = FALSE) { + # evaluate arguments, generate data + .dots <- match.call(expand.dots = FALSE)$`...` + .dat <- get_dot_data(x, .dots) + + if (is.data.frame(x)) { + + # iterate variables of data frame + for (i in colnames(.dat)) { + x[[i]] <- as.character(to_label_helper(.dat[[i]], add.non.labelled, prefix, var.label, drop.na, drop.levels)) + } + # coerce to tibble + x <- tibble::as_tibble(x) + } else { + x <- as.character(to_label_helper(.dat, add.non.labelled, prefix, var.label, drop.na, drop.levels)) + } + + x +} diff --git a/R/to_value.R b/R/to_value.R index 838b37bd..7336a399 100644 --- a/R/to_value.R +++ b/R/to_value.R @@ -1,7 +1,7 @@ #' @title Convert factors to numeric variables #' @name to_value #' -#' @description This function converts (replaces) factor values with the +#' @description This function converts (replaces) factor levels with the #' related factor level index number, thus the factor is converted to #' a numeric variable. #' @@ -16,8 +16,10 @@ #' #' @return A numeric variable with values ranging either from \code{start.at} to #' \code{start.at} + length of factor levels, or to the corresponding -#' factor levels (if these were numeric). Or a data frame with numeric -#' variables, if \code{x} was a data frame. +#' factor levels (if these were numeric). If \code{x} is a data frame, +#' the complete data frame \code{x} will be returned, where variables +#' specified in \code{...} are coerced to numeric; if \code{...} is +#' not specified, applies to all variables in the data frame. #' #' @inheritParams to_factor #' diff --git a/man/add_labels.Rd b/man/add_labels.Rd index f67c566d..dec25a54 100644 --- a/man/add_labels.Rd +++ b/man/add_labels.Rd @@ -15,10 +15,10 @@ remove_labels(x, ..., value) \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} diff --git a/man/count_na.Rd b/man/count_na.Rd index 6b157665..92cbb4ec 100644 --- a/man/count_na.Rd +++ b/man/count_na.Rd @@ -9,10 +9,10 @@ count_na(x, ...) \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} } diff --git a/man/descr.Rd b/man/descr.Rd index 34629ba2..dd0917b1 100644 --- a/man/descr.Rd +++ b/man/descr.Rd @@ -10,10 +10,10 @@ descr(x, ..., max.length = NULL) \item{x}{A vector or a data frame. May also be a grouped data frame (see 'Note' and 'Examples').} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} diff --git a/man/dicho.Rd b/man/dicho.Rd index 0a9e27da..8b5b6d05 100644 --- a/man/dicho.Rd +++ b/man/dicho.Rd @@ -10,10 +10,10 @@ dicho(x, ..., dich.by = "median", as.num = FALSE, var.label = NULL, \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} diff --git a/man/find_var.Rd b/man/find_var.Rd index c66bdd7c..0dd95a8a 100644 --- a/man/find_var.Rd +++ b/man/find_var.Rd @@ -6,7 +6,7 @@ \usage{ find_var(data, pattern, ignore.case = TRUE, search = c("name_label", "name_value", "label_value", "name", "label", "value", "all"), - as.df = FALSE, as.varlab = FALSE) + as.df = FALSE, as.varlab = FALSE, fuzzy = FALSE) } \arguments{ \item{data}{A data frame.} @@ -45,6 +45,11 @@ is returned (instead of their column indices).} \item{as.varlab}{Logical, if \code{TRUE}, not only column indices, but also variables labels of matching variables are returned (as data frame).} + +\item{fuzzy}{Logical, if \code{TRUE}, "fuzzy matching" (partial and +close distance matching) will be used to find \code{pattern} +in \code{data} if no exact match was found. \code{\link{str_pos}} +is used for fuzzy matching.} } \value{ A named vector with column indices of found variables (variable names diff --git a/man/frq.Rd b/man/frq.Rd index 92883c22..316abc25 100644 --- a/man/frq.Rd +++ b/man/frq.Rd @@ -10,10 +10,10 @@ frq(x, ..., sort.frq = c("none", "asc", "desc"), weight.by = NULL) \item{x}{A vector or a data frame. May also be a grouped data frame (see 'Note' and 'Examples').} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} diff --git a/man/group_var.Rd b/man/group_var.Rd index ca33f359..2f787473 100644 --- a/man/group_var.Rd +++ b/man/group_var.Rd @@ -14,10 +14,10 @@ group_labels(x, ..., groupsize = 5, right.interval = FALSE, \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} diff --git a/man/rec.Rd b/man/rec.Rd index 8124ba82..268197e0 100644 --- a/man/rec.Rd +++ b/man/rec.Rd @@ -10,10 +10,10 @@ rec(x, ..., rec, as.num = TRUE, var.label = NULL, val.labels = NULL, \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} @@ -84,7 +84,7 @@ Please note following behaviours of the function: \item Non-matching values will be set to \code{NA}, unless captured by the \code{"else"}-token. \item Tagged NA values (see \code{\link[haven]{tagged_na}}) and their value labels will be preserved when copying NA values to the recoded vector with \code{"else=copy"}. \item Variable label attributes (see, for instance, \code{\link{get_label}}) are preserved (unless changed via \code{var.label}-argument), however, value label attributes are removed (except for \code{"rev"}, where present value labels will be automatically reversed as well). Use \code{val.labels}-argument to add labels for recoded values. - \item If \code{x} is a data frame or list-object, all variables should have the same categories resp. value range (else, see second bullet, \code{NA}s are produced). + \item If \code{x} is a data frame, all variables should have the same categories resp. value range (else, see second bullet, \code{NA}s are produced). } } \examples{ diff --git a/man/recode_to.Rd b/man/recode_to.Rd index 85368d26..c91e5ac7 100644 --- a/man/recode_to.Rd +++ b/man/recode_to.Rd @@ -10,10 +10,10 @@ recode_to(x, ..., lowest = 0, highest = -1, append = FALSE, \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} diff --git a/man/ref_lvl.Rd b/man/ref_lvl.Rd index f6cb021e..fe885e66 100644 --- a/man/ref_lvl.Rd +++ b/man/ref_lvl.Rd @@ -9,10 +9,10 @@ ref_lvl(x, ..., lvl = NULL, value) \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} diff --git a/man/replace_na.Rd b/man/replace_na.Rd index 1d440204..4499beb4 100644 --- a/man/replace_na.Rd +++ b/man/replace_na.Rd @@ -9,10 +9,10 @@ replace_na(x, ..., value, na.label = NULL, tagged.na = NULL) \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} diff --git a/man/row_sums.Rd b/man/row_sums.Rd index b86b7f32..201729e2 100644 --- a/man/row_sums.Rd +++ b/man/row_sums.Rd @@ -12,10 +12,10 @@ row_means(x, ..., n, var = "rowmeans", append = FALSE) \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} @@ -24,6 +24,10 @@ the calculations.} \item{var}{Name of new the variable with the row sums or means.} +\item{append}{Logical, if \code{TRUE} and \code{x} is a data frame, +\code{x} including the new variables as additional columns is returned; +if \code{FALSE} (the default), only the new variables are returned.} + \item{n}{May either be \itemize{ \item a numeric value that indicates the amount of valid values per row to calculate the row mean; @@ -33,7 +37,7 @@ If a row's sum of valid values is less than \code{n}, \code{NA} will be returned } \value{ For \code{row_sums()}, a tibble with one variable: the row sums from - \code{x}; for \code{row_means()}, a tibble with one variable: row + \code{x}; for \code{row_means()}, a tibble with one variable: the row means from \code{x}. } \description{ diff --git a/man/set_labels.Rd b/man/set_labels.Rd index 4f7eb31d..c290e76a 100644 --- a/man/set_labels.Rd +++ b/man/set_labels.Rd @@ -10,10 +10,10 @@ set_labels(x, ..., labels, force.labels = FALSE, force.values = TRUE, \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} diff --git a/man/set_na.Rd b/man/set_na.Rd index d32deb2d..784e654f 100644 --- a/man/set_na.Rd +++ b/man/set_na.Rd @@ -9,10 +9,10 @@ set_na(x, ..., na, drop.levels = TRUE, as.tag = FALSE, value) \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} diff --git a/man/split_var.Rd b/man/split_var.Rd index 400fc344..89fd9705 100644 --- a/man/split_var.Rd +++ b/man/split_var.Rd @@ -10,10 +10,10 @@ split_var(x, ..., groupcount, as.num = FALSE, val.labels = NULL, \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} diff --git a/man/std.Rd b/man/std.Rd index 9bb8ce9a..4d33c85b 100644 --- a/man/std.Rd +++ b/man/std.Rd @@ -12,10 +12,10 @@ center(x, ..., include.fac = TRUE, append = FALSE, suffix = "_c") \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} diff --git a/man/to_character.Rd b/man/to_character.Rd index 5e7400f1..b37c33ff 100644 --- a/man/to_character.Rd +++ b/man/to_character.Rd @@ -10,14 +10,14 @@ to_character(x, ..., add.non.labelled = FALSE, prefix = FALSE, \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} -\item{add.non.labelled}{logical, if \code{TRUE}, values without associated +\item{add.non.labelled}{Logical, if \code{TRUE}, values without associated value label will also be converted to labels (as is). See 'Examples'.} \item{prefix}{Logical, if \code{TRUE}, the value labels used as factor levels diff --git a/man/to_dummy.Rd b/man/to_dummy.Rd index b0ef9541..d5676cd4 100644 --- a/man/to_dummy.Rd +++ b/man/to_dummy.Rd @@ -9,10 +9,10 @@ to_dummy(x, ..., var.name = "name", suffix = c("numeric", "label")) \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} diff --git a/man/to_factor.Rd b/man/to_factor.Rd index 78626232..38c8cc70 100644 --- a/man/to_factor.Rd +++ b/man/to_factor.Rd @@ -9,10 +9,10 @@ to_factor(x, ..., add.non.labelled = FALSE, ref.lvl = NULL) \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} @@ -26,10 +26,10 @@ will become the reference level. See \code{\link{ref_lvl}} for details.} } \value{ -A factor variable, including variable and value labels. If \code{x} +A factor, including variable and value labels. If \code{x} is a data frame, the complete data frame \code{x} will be returned, where variables specified in \code{...} are coerced - to factor variables (including variable and value labels); + to factors (including variable and value labels); if \code{...} is not specified, applies to all variables in the data frame. } @@ -98,7 +98,7 @@ to_factor(efc, ~contains("cop"), c161sex:c175empl) } \seealso{ -\code{\link{to_value}} to convert a factor into a numeric value and - \code{\link{to_label}} to convert a value into a factor with labelled +\code{\link{to_value}} to convert a factor into a numeric vector and + \code{\link{to_label}} to convert a vector into a factor with labelled factor levels. } diff --git a/man/to_label.Rd b/man/to_label.Rd index 5a95bfb5..5c6ee200 100644 --- a/man/to_label.Rd +++ b/man/to_label.Rd @@ -10,14 +10,14 @@ to_label(x, ..., add.non.labelled = FALSE, prefix = FALSE, \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} -\item{add.non.labelled}{logical, if \code{TRUE}, values without associated +\item{add.non.labelled}{Logical, if \code{TRUE}, values without associated value label will also be converted to labels (as is). See 'Examples'.} \item{prefix}{Logical, if \code{TRUE}, the value labels used as factor levels @@ -37,14 +37,14 @@ dropped (i.e. \code{\link{droplevels}} will be applied before returning the result).} } \value{ -A factor variable with the associated value labels as factor levels. If \code{x} +A factor with the associated value labels as factor levels. If \code{x} is a data frame, the complete data frame \code{x} will be returned, where variables specified in \code{...} are coerced to factors; if \code{...} is not specified, applies to all variables in the data frame. } \description{ -This function converts (replaces) variable values (also of factors +This function converts (replaces) values of a variable (also of factors or character vectors) with their associated value labels. Might be helpful for factor variables. For instance, if you have a Gender variable with 0/1 value, and associated diff --git a/man/to_value.Rd b/man/to_value.Rd index 6fde6493..758bd130 100644 --- a/man/to_value.Rd +++ b/man/to_value.Rd @@ -9,10 +9,10 @@ to_value(x, ..., start.at = NULL, keep.labels = TRUE) \arguments{ \item{x}{A vector or data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} @@ -29,11 +29,13 @@ if present. See 'Examples' and \code{\link{set_labels}} for more details.} \value{ A numeric variable with values ranging either from \code{start.at} to \code{start.at} + length of factor levels, or to the corresponding - factor levels (if these were numeric). Or a data frame with numeric - variables, if \code{x} was a data frame. + factor levels (if these were numeric). If \code{x} is a data frame, + the complete data frame \code{x} will be returned, where variables + specified in \code{...} are coerced to numeric; if \code{...} is + not specified, applies to all variables in the data frame. } \description{ -This function converts (replaces) factor values with the +This function converts (replaces) factor levels with the related factor level index number, thus the factor is converted to a numeric variable. } diff --git a/man/zap_inf.Rd b/man/zap_inf.Rd index b6d06527..86e8af57 100644 --- a/man/zap_inf.Rd +++ b/man/zap_inf.Rd @@ -9,10 +9,10 @@ zap_inf(x, ...) \arguments{ \item{x}{A vector or a data frame.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} } diff --git a/man/zap_labels.Rd b/man/zap_labels.Rd index 69e0b05b..d2cbdbdd 100644 --- a/man/zap_labels.Rd +++ b/man/zap_labels.Rd @@ -19,10 +19,10 @@ zap_unlabelled(x, ...) \item{x}{(partially) \code{\link[haven]{labelled}} vector or a data frame with such vectors.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} diff --git a/man/zap_na_tags.Rd b/man/zap_na_tags.Rd index 05201926..2ca954ac 100644 --- a/man/zap_na_tags.Rd +++ b/man/zap_na_tags.Rd @@ -10,10 +10,10 @@ zap_na_tags(x, ...) \item{x}{A \code{\link[haven]{labelled}} vector with \code{tagged_na} values, or a data frame with such vectors.} -\item{...}{Optional, unquoted names of variables. Required, if \code{x} is -a data frame (and no vector) and only selected variables -from \code{x} should be processed. You may also use functions like -\code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. +\item{...}{Optional, unquoted names of variables that should be selected for +further processing. Required, if \code{x} is a data frame (and no +vector) and only selected variables from \code{x} should be processed. +You may also use functions like \code{:} or dplyr's \code{\link[dplyr]{select_helpers}}. The latter must be stated as formula (i.e. beginning with \code{~}). See 'Examples' or \href{../doc/design_philosophy.html}{package-vignette}.} }