Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsetting loses variable labels #392

Closed
Crismoc opened this issue Jul 27, 2018 · 15 comments
Closed

Subsetting loses variable labels #392

Crismoc opened this issue Jul 27, 2018 · 15 comments
Labels
feature a feature request or enhancement

Comments

@Crismoc
Copy link

Crismoc commented Jul 27, 2018

When I import SPSS data with read_sav() variable labels are correctly imported.

data<-haven::read_sav("data.sav")
str(data$sexo)
> 'labelled' num [1:738] 1 0 0 1 1 0 1 0 0 0 ...
 - attr(*, "label")= chr "Sexo (0: Hombre, 1: Mujer)"
 - attr(*, "format.spss")= chr "F8.2"
 - attr(*, "display_width")= int 10
 - attr(*, "labels")= Named num [1:2] 0 1
  ..- attr(*, "names")= chr [1:2] "Mujer" "Hombre"

Then I create multiple subsamples of it and write new SPSS data frames with write_sav().

haven::write_sav(data[sample(nrow(datos),350),], "data_subsample.sav")

When opening the new dataframes in SPSS -or reading them back with read_sav()- variable labels are lost. Is there a way of fixing this?

screenshot 2018-07-27 11 38 55

Thanks!

R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] haven_1.1.1

loaded via a namespace (and not attached):
[1] compiler_3.5.0 magrittr_1.5   tools_3.5.0    pillar_1.2.3   tibble_1.4.2  
[6] yaml_2.1.19    Rcpp_0.12.17   forcats_0.3.0  rlang_0.2.1  
@ekothe

This comment has been minimized.

@Crismoc

This comment has been minimized.

@ekothe

This comment has been minimized.

@Crismoc

This comment has been minimized.

@hadley

This comment has been minimized.

@hadley hadley added the reprex needs a minimal reproducible example label Aug 28, 2018
@Crismoc

This comment has been minimized.

@huftis

This comment has been minimized.

@Crismoc

This comment has been minimized.

@huftis

This comment has been minimized.

@Crismoc

This comment has been minimized.

@huftis
Copy link
Contributor

huftis commented Aug 29, 2018

Ah, I see. The reason for this bug is that subsetting the variables loses their attributes. And the reason for this is that the variables with only a label but no labels don’t have custom class (e.g. haven_labelled) (for which a subsetting method is defined).

Note that this only applied to variables that have a label attribute but don’t have a labels attribute. (So, for instance, the fare variable doesn’t work, but the Residence variable does.)

A possible solution would be for the importer to store the variables as haven_labelled objects. The class haven_labelled would then have to be redefined to have either 1) labels, 2) a label, or 3) both labels and a label. The class name would fit this definition, but some functions using haven_labelled objects would probably have to be redefined to accept labels-less haven_labelled objects.

Here’s a reprex:

library(haven)
d <- read_sav("https://www.sheffield.ac.uk/polopoly_fs/1.547009!/file/Titanic_2.sav")

class(d$Residence) # OK
#> [1] "labelled"
class(d$fare) # Missing class
#> [1] "numeric"

# Subsetting ‘fare’ loses the attribute
attr(d$fare, "label")
#> [1] "Cost of ticket"
attr(d$fare[1:3], "label")
#> NULL

# But subsetting ‘Residence’ works fine
attr(d$Residence, "label")
#> [1] "Country of residence"
attr(d$Residence[1:3], "label")
#> [1] "Country of residence"

@huftis

This comment has been minimized.

@hadley hadley added feature a feature request or enhancement and removed reprex needs a minimal reproducible example labels Jan 24, 2019
@hadley hadley changed the title write_sav() loses variable labels Subsetting loses variable labels Jan 24, 2019
@hadley
Copy link
Member

hadley commented Jan 24, 2019

I think the proposal to always use haven_labelled would create more problems than it solves, because more people would be force to confront an exotic variable type that doesn't actually bear on their analysis.

Instead I think you're better off creating a custom restore labels functions, something like this (untested, but should be close):

restore_labels <- function(df, orig) {
  for (var in names(original)) {
    if (is.null(df[[var]]))
      next
    
    if (!is.null(attr(df[[var]], "label")))
      next
    
    if (is.null(attr(orig[[var]], "label")))
      next
    
    attr(df[[var]], "label") <- attr(orig[[var]], "label")
  }
  
  df
}

@hadley hadley closed this as completed Jan 24, 2019
@larmarange
Copy link
Contributor

labelled package propose a copy_labels function doing exactly this: http://larmarange.github.io/labelled/reference/copy_labels.html

@lock
Copy link

lock bot commented Jul 23, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jul 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants