New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as_factor drops unused labels when levels = "default". #172

Closed
itsdalmo opened this Issue Jun 6, 2016 · 5 comments

Comments

Projects
None yet
3 participants
@itsdalmo
Copy link
Contributor

itsdalmo commented Jun 6, 2016

As title states, as_factor.labelled drops unused labels when levels = "default" (or "both").

Example:

  s1 <- labelled(rep(1, 3), c("A" = 1, "B" = 2, "C" = 3))
  exp <- factor(rep("A", 3), levels = c("A", "B", "C"))
  str(as_factor(s1, levels = "default"))
  # Factor w/ 1 level "A": 1 1 1

Maybe this is intended, but I'd like to propose that the default should include all labels to safely (and easily) convert labelled variables to factors without information-loss. Users can instead explicitly call droplevels() after converting to produce the result above.

-Kristian

@itsdalmo

This comment has been minimized.

Copy link
Contributor

itsdalmo commented Jun 7, 2016

(@hadley: Made two posts earlier, merged them into this instead.)

Original issue

labelled contains all the labels and values, but none of the levels options for as_factor can be used to convert to factor in a reliable way. Since the assumption is that users will want to "coerce to a standard R class soon after importing", I think it's important that they don't lose any information along the way:

labs <- setNames(1:2, c("A", "B"))
labelled <- labelled(c(1, 3), labels = labs)

as_factor(labelled, levels = "default") # Drops B from levels
as_factor(labelled, levels = "labels")  # 3 becomes NA

Realistic example

I think it's fairly common for survey data to have implicit missing values on one or more items. In the example below, users would lose the implicit missing values when converting to factor.

set.seed(84)
vals <- c("Strongly Agree", "Agree", "Neutral", "Disagree", "Strongly disagree", "Don't know")
labelled <- labelled(sample(1:5, size = 100, replace = TRUE), labels = setNames(1:6, vals))

as_factor(labelled, levels = "default") # Drops "Strongly disagree" and "Don't know"

Using levels = "labels" is fine if you can assume that that all variables have correct labels, if not then users will have to:

  • Figure out (i.e. manually inspect) whether any labels are missing for valid responses.
  • Add missing labels to the labelled vector before converting to factor (not supported out of the box).

Fix

It's just a matter of always keeping all the labels when converting to factor. In response to what you wrote in the pull request, I'd also like to suggest that as_factor prefers the order in which the labels are specified if typeof(x) == "character", instead of the alphabetically sorted values as in this example:

set.seed(84)
vals <- c("Strongly Agree", "Agree", "Neutral", "Disagree", "Strongly disagree", "Don't know")
labelled <- labelled(sample(vals, size = 100, replace = TRUE), labels = setNames(vals, vals))

as_factor(labelled, levels = "default")
#> ...
#> Levels: Agree Disagree Don't know Neutral Strongly Agree Strongly disagree

In the above, I think it makes more sense to just keep the order of the labels for values that have labels, and just add any values (sorted) without labels to the end of the factor levels.

Code

  # Inside as_factor.labelled
  levels <- match.arg(levels)
  labels <- attr(x, "labels")

  if (levels == "default" || levels == "both") {
    if (levels == "both") {
      names(labels) <- paste0("[", labels, "] ", names(labels))
    }

    values <- sort(unique(x))
    levs <- replace_with(values, unname(labels), names(labels))

    # Retain all labels (and retain label order in labelled character vectors) (#172)
    if (typeof(x) == "character") {
      levs <- unique(c(names(labels), levs))
    } else {
      levs <- c(setNames(values, levs), labels)
      levs <- unique(names(sort(levs)))
    }

    x <- replace_with(x, unname(labels), names(labels))

    factor(x, levels = levs, ordered = FALSE)
  } else {
    # Hidden
  }

This way all the information is preserved when labels = "default", and users can safely turn labelled vectors into factors before cleaning a dataset. I also think that as long as we are sorting numeric values, we can sort the missing labels in relation to the existing. E.g.:

  vals <- c("Agree", "Neutral", "Disagree", "Don't know")
  labelled <- labelled(c(1, 5), labels = setNames(1:4, vals))
  as_factor(labelled)
  #> [1] Agree 5    
  #> Levels: Agree Neutral Disagree Don't know 5
  vals <- c("Agree", "Neutral", "Disagree", "Don't know")
  labelled <- labelled(c(1, 4), labels = setNames(c(1:3, 5), vals))
  as_factor(labelled)
  #> [1] Agree 4
  #> Levels: Agree Neutral Disagree 4 Don't know 
@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jun 8, 2016

Can you confirm that order of the labels is actually something that the user can control in SPSS/Stata/SAS? i.e. if you have a labelled character vector c("M", "F") with labels "male" and "female", how do you describe that "male" should come first?

@itsdalmo

This comment has been minimized.

Copy link
Contributor

itsdalmo commented Jun 8, 2016

@hadley:
You are right, the order of labels for is not something the user control in SPSS (not sure about Stata/SAS). I'm guessing the order for labels after importing would be alphabetical to begin with, so it would only be helpful after using the labelled() constructor (where order can be specified).

@larmarange

This comment has been minimized.

Copy link
Contributor

larmarange commented Jun 9, 2016

In some way, this discussion highlights a more general topic regarding labelled vectors. As far as I understood it, we can distinct two different approaches/philosophy.

The first one, adopted by haven, is to consider that labelled class is just an intermediate format for data import from SAS/Stata/SPSS. Therefore, there is no need to provide too many functions for labelled vectors manipulation, the main purpose being to convert to more "standard" R formats. In such approach, functions need to remain simple with not too much options.

A second approach is to consider that labelled vectors are also useful for data manipulation, data cleaning and data recoding, in particular for scientists used to work with survey data, in a study team where over scientists are working with other software, and/or when the purpose is to re-export to SAS/Stata/SPSS. In such situation, it is appropriate to have several functions to manipulate labelled vectors. However, this is probably not the purpose of the haven package. This is why the labelled has been developed.

In some way, these two approaches are not antagonist, but complementary.

@itsdalmo if you need to control the order of the factor that is produced, in particular to keep the order of the labels, you could have a look to function to_factor in labelled package. Being able to control the order of labels was also one of my need. In labelled package, you have several option to define how to sort them. However, this sort of function is probably relevant only for the second approach, not the first one.

@itsdalmo

This comment has been minimized.

Copy link
Contributor

itsdalmo commented Jun 9, 2016

Thanks for the tip @larmarange!

However, I don't think my needs warrant another import/dependency as I really only need to be able to convert labelled vectors to factors in a safe and reliable manner. And I think that levels = "default" not dropping unused labels would do just that.

@hadley was correct about my second point (at least for SPSS); there is no way to create an equivalent to the code below in SPSS. I.e., SPSS sorts labels according to value and we'd end up with Female as the first label when reading the vector from .sav, or when opening it in SPSS.

labelled(c("M", "F"), labels = c("Male" = "M", "Female" = "F"))

But I think the original issue is valid. If we are meant to convert away from labelled soon after importing, we need to feel safe doing so without manually checking for missing labels. After removing the if statement in my pull request, as_factor keeps all existing labels and values that are missing labels. I still think it's a good idea to do the sorting on the existing and the missing labels together.

Since the updated PR keeps the sorting for character vectors also, this means that we can't as_factor(labelled(...)) a character vector in R and expect labels to have levels in the order specified, but users should probably just be creating the factor directly anyhow.

I think this makes as_factor safer to use/better overall since you don't have to choose between:

  • A: Losing labels with no observations. levels = "default" (or "both")
  • B: Losing values with no label. levels = "label" (or "value")

With the PR, you can use levels = "default" (or "both") to get both labels and values with missing labels safely converted to factor and then call droplevels() to get the same result as in the current dev version.

@itsdalmo itsdalmo referenced this issue Jun 9, 2016

Closed

WIP: Basic API for tagged missing values #175

3 of 5 tasks complete

hadley added a commit that referenced this issue Jun 9, 2016

Replace tagged NAs in as_factor.
Now preserves all labels in factor levels (labels not in data are added to the end), so is part of #172

hadley added a commit that referenced this issue Jun 9, 2016

Fix for #172 (labels) and #177 (variable label). (#179)
* Add test for #172. Labels for missing values should be preserved.

* Importing setNames from stats

* Rebased fix for #172 and #177.
as_factor now preserves variable label. (#177)

as_factor includes and sorts both existing and missing labels. (#172)

* Use stats::setNames instead of importing.

* Added bullets explaining fixes for #172 and #177 to news.

@hadley hadley closed this Jun 9, 2016

@lock lock bot locked and limited conversation to collaborators Jun 27, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.