New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

additional argument for as_factor #81

Closed
larmarange opened this Issue Jun 23, 2015 · 16 comments

Comments

Projects
None yet
2 participants
@larmarange
Copy link
Contributor

larmarange commented Jun 23, 2015

When importing data from SPSS (or other software), in some cases, some values have no label. It could be due to an error in data documentation or, for some numerical variables, that only specific values got a label. An typical example: a score from 0 to 10 with only 0 and 10 having a label.

Currently as_factor.labelled apply NA to values who don't have a corresponding label. It would be great to have a option to keep these values (and in that case to crate a level whose label would be the original numeric code).

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jun 23, 2015

Could you provide a small reproducible example? (using labelled())

@larmarange

This comment has been minimized.

Copy link
Contributor

larmarange commented Jun 23, 2015

test <- labelled(c(1,2,3,4,5,1,2,5,4), c(Bad = 1, Good = 5))
as_factor(test)

will return

[1] Bad  <NA> <NA> <NA> Good Bad  <NA> Good <NA>
Levels: Bad Good

while I would like to obtain

[1] Bad  2  3 4  Good Bad  2 Good 4
Levels: Bad 2 3 4 Good
@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jun 23, 2015

Hmmm, could imagine this as as_factor(test, "either") or as_factor(test, drop_na = FALSE). What do you think?

@larmarange

This comment has been minimized.

Copy link
Contributor

larmarange commented Jun 23, 2015

From my point of view, the name drop_na is not appropriate, as value labels and missing values are two different concepts in a SPSS point of view. A value with no label doesn't mean this is a missing value.

@larmarange

This comment has been minimized.

Copy link
Contributor

larmarange commented Jun 23, 2015

Something like missing_labels_to_NA would be probably more appropriate.

Furthermore, the default value should be FALSE as by default the function should not change the data itself, only the type of data. Transforming values with missing labels should be a deliberate choice of the user (in particular as most users don't check that all values have a corresponding label).

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jun 24, 2015

Well, if you're converting an labelled integer to a factor, you are fundamentally changing the type. I don't see how to get around that.

@larmarange

This comment has been minimized.

Copy link
Contributor

larmarange commented Jun 24, 2015

Conceptually, there is a difference in changing how data is stored/represented (as a factor, a character, a labelled integer...) and changing the information itself (i.e. converting several different values into a single NA value).

@larmarange

This comment has been minimized.

Copy link
Contributor

larmarange commented Jul 12, 2015

Hi,

I drafted a labelled package to provide a set of functions to manipulate variable labels, value labels and missing values, based on the labelled class introduced by haven.

cf. https://github.com/larmarange/labelled

An introduction to that package is available at https://github.com/larmarange/labelled/blob/master/vignettes/intro_labelled.Rmd

This package also provides labelled function and basic methods for labelled class (in order not to have to require haven for manipulating labelled classes).

In order to make all function and argument names consistent, the word "missing" has been used to refer to defined missing values and "na" only to refer to the internal R NA value.

Therefore, the drop_na argument in as_factor has been renamed to missing_to_na. The as_factor method implemented in labelled also provides additional options not available in haven.

But it's creating inconsistencies between the two packages and a call to as_factor will not return the same thing according to which package is used.

There could be several options to solve such inconsistencies and I would like to know your opinion on that subject:

  1. as_factor method should be renamed in labelled package to something else;
  2. as_factor method should not be part of haven as the core purpose of haven is to import data, but not to manipulate labelled vector once imported;
  3. renaming the drop_na argument in haven to be consistent with labelled. However, the question for any additional arguments proposed by labelled but not implemented in haven will remain opened;
  4. another solution I forgot.

Best regards

@larmarange

This comment has been minimized.

Copy link
Contributor

larmarange commented Jul 12, 2015

An HTML version of the Introduction to labelled: http://joseph.larmarange.net/intro_labelled.html

@larmarange

This comment has been minimized.

Copy link
Contributor

larmarange commented Jul 14, 2015

Note: this issue #82 is fixed in labelled

@larmarange

This comment has been minimized.

Copy link
Contributor

larmarange commented May 30, 2016

To avoid an inconsistent behaviour between haven and labelled packages, the function as_factor has been renamed to_factor in labelled package.

@hadley hadley added the semantics label May 30, 2016

@hadley

This comment has been minimized.

Copy link
Member

hadley commented May 31, 2016

Ok - I now understand labelled values much better, and I think that this behaviour (use label if available, otherwise use value) should probably be the default. I'm not sure what to call this behaviour, so I'll probably just go with levels = "default".

@hadley hadley closed this in 9efd946 May 31, 2016

@hadley

This comment has been minimized.

Copy link
Member

hadley commented May 31, 2016

Please give this new behaviour a spin - I'm confident it should be a much better mapping to the way that people usually use labelled values (now that I have a much better mental model of how they work)

@larmarange

This comment has been minimized.

Copy link
Contributor

larmarange commented May 31, 2016

My feeling is that it would be better to have a clear distinction between:

  1. What are the levels you want to get?
  2. What are the values that should be converted into NA

In labelled package (function to_factor), it is obtained by having different arguments, for each purpose.

to_factor(x, levels = c("labels", "values", "prefixed"),
ordered = FALSE, missing_to_na = FALSE, nolabel_to_na = FALSE,
sort_levels = c("auto", "none", "labels", "values"), decreasing = FALSE,
...)

levels determines if the factor levels should be the labels, the values or the labels prefixed by the values (in all cases, if a value doesn't have a label it is used as a label).

nolabel_to_na determines if values with no label should be converted into NA.

By the way, did the is_na attribute disappeared?

@hadley

This comment has been minimized.

Copy link
Member

hadley commented May 31, 2016

That seems like too many options to me.

Yes, is_na has gone away - something better will come back as part of #170

@larmarange

This comment has been minimized.

Copy link
Contributor

larmarange commented May 31, 2016

Following #170, I will remove support of missing values in labelled packages, focusing only on value labels.

In fact, in labelled package, considering that nolabel_to_na and sort_levels are avaibale as standalone function, the only arguments specific to to_factor are levels and ordered.

@lock lock bot locked and limited conversation to collaborators Jun 27, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.