Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data not correctly read in with haven::read_sav whn special characters present. Character encoding problem? #560

Closed
deschen1 opened this issue Dec 16, 2020 · 6 comments

Comments

@deschen1
Copy link

This is an exact copy of a question I posted on stack overflow. However, I think it is a direct issue in read_sav, so I'm hoping to fix the issue (if it is one in the package):

I have a problem with my data set which I'm downloading from a website through an API call. A reproducible problem can be found here.

The data set is a .sav format file (usually opened in SPSS) and contains a character column which haven::read_sav seems to be failing to properly read in.

In fact, running the following code:

library(tidyverse)
library(haven)

dat <- read_sav("problem.sav")

dat %>%
  select(uuid, QB5B_2) %>%
  mutate(nchars = apply(across(QB5B_2), 1, nchar)) %>%
  arrange(-nchars)

gives the following result (I shortened the output a bit for better readability):

# A tibble: 2 x 3
  uuid            QB5B_2                                                     nchars
  <chr>           <chr>                                                       <int>
1 wy3nx3y56cju2t~ Nie wiem, nie było takiej opcji wyboru. [...] błądzą jak d~   608
2 vavwkubw6wzdfg~ Car les GAFA sont des sociétés Américaines [...] elle~        248

However, the second result is wrong. The character is actually 444 characters long. Opening the file in SPSS works fine (shows 444 characters), downloading my data in a different format, e.g. csv or xlsx, also works fine and gives the correct results. The problems are just present when using a .sav file (which is a requirement in my case) + the read_sav function.

Any ideas what I can do about it?


my data: I'm sharing the data as raw vector that comes directly from my APi pipeline. Please let me know if it would be preferable to directly get a .sav file e.g through Google drive.

library(haven)
library(httr)
library(jsonlite)

download_args <- list(format = unbox(spss16_oe),
                      cond   = unbox("uuid =='vavwkubw6wzdfg3d' or uuid == 'wy3nx3y56cju2txg'"))

download_request <- POST(url = paste0("MY_URL"),
                         add_headers('x-apikey' = "MY_KEY"),
                         content_type("application/json"),
                         body = toJSON(download_args),
                         encode = "json")

download_content <- content(download_request, encoding = "UTF-8")

writeBin(download_content, con = paste0("problem.sav"))
dat <- read_sav("problem.sav")

data in raw format (which is identical to what you get as download_content:

download_content <- as.raw(c(0x50, 0x4b, 0x03, 0x04, 0x14, 0x00, 0x08, 0x00, 0x08, 
                  0x00, 0xef, 0x45, 0x90, 0x51, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 
                  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x0a, 0x00, 0x00, 0x00, 0x32, 
                  0x30, 0x31, 0x32, 0x37, 0x31, 0x2e, 0x73, 0x61, 0x76, 0xed, 0x9c, 
                  0xcf, 0x6b, 0x1b, 0x47, 0x14, 0xc7, 0xc7, 0x4e, 0x4b, 0x9a, 0x96, 
                  0x96, 0x10, 0x72, 0x6b, 0x0b, 0x0f, 0x1a, 0x70, 0x02, 0x8e, 0xaa, 
                  0xd8, 0x8e, 0x93, 0x98, 0x84, 0x5a, 0x89, 0xed, 0xd4, 0xc5, 0x38, 
                  0xa9, 0x63, 0xe3, 0xfe, 0xa0, 0x84, 0xd1, 0xee, 0x58, 0x1a, 0x69, 
                  0x35, 0xb3, 0x99, 0xdd, 0xd5, 0x7a, 0xf7, 0xd2, 0x62, 0x08, 0xf9, 
                  0x1b, 0x42, 0x2e, 0x39, 0xc6, 0xff, 0x43, 0x6e, 0xb2, 0xee, 0xfd, 
                  0x1b, 0x7a, 0x29, 0x3d, 0xf5, 0x2f, 0x90, 0xdd, 0x37, 0xab, 0x95, 
                  0xac, 0x04, 0x16, 0x4a, 0xa1, 0x74, 0x53, 0xde, 0xa0, 0x91, 0x76, 
                  0xe7, 0x33, 0xf3, 0xe6, 0xbd, 0x37, 0x33, 0xdf, 0xdd, 0x93, 0x2e, 
                  0xad, 0x6d, 0xcc, 0x2d, 0x5f, 0xfe, 0xe2, 0x0a, 0x3c, 0x7a, 0xf8, 
                  0xe8, 0x11, 0xac, 0xd4, 0xb6, 0x6b, 0xb0, 0xb6, 0xbe, 0xb1, 0x0a, 
                  0x1b, 0xeb, 0x9b, 0x3b, 0xdf, 0xc1, 0xd6, 0xea, 0xca, 0xd7, 0xb5, 
                  0x6d, 0xd8, 0x12, 0x9e, 0xe0, 0x81, 0x80, 0xb9, 0x6a, 0xa5, 0x0a, 
                  0x81, 0x1f, 0x04, 0x52, 0x43, 0x5e, 0xa6, 0x19, 0x63, 0x17, 0xf0, 
                  0x6b, 0x8a, 0x0d, 0xcb, 0x34, 0x1b, 0x95, 0xef, 0x97, 0xaf, 0x2d, 
                  0xc2, 0x8a, 0x70, 0x70, 0x54, 0xf5, 0xe6, 0xd2, 0xc2, 0x8d, 0xa5, 
                  0xf9, 0xaa, 0xb7, 0xb5, 0xf0, 0xe0, 0x71, 0xf2, 0xb8, 0x5a, 0x09, 
                  0x78, 0x17, 0xfe, 0x41, 0xc9, 0xed, 0x9f, 0x67, 0xa7, 0xf3, 0xb1, 
                  0xf3, 0x53, 0x59, 0xdd, 0xd9, 0x59, 0x5f, 0xb1, 0x5d, 0x3e, 0xc3, 
                  0xa6, 0x28, 0x92, 0xee, 0x12, 0x3c, 0xe4, 0x26, 0x94, 0x8e, 0xf4, 
                  0xb9, 0x0a, 0x41, 0xba, 0x42, 0x85, 0x72, 0x4f, 0x0a, 0x63, 0xc7, 
                  0x9f, 0x60, 0x19, 0x0d, 0x9f, 0xfa, 0x7c, 0x2a, 0xab, 0x93, 0xf1, 
                  0xdc, 0x9e, 0xb4, 0x7f, 0x7b, 0x2a, 0xab, 0xbb, 0x39, 0xbf, 0x88, 
                  0x4d, 0xbb, 0x4b, 0x70, 0x8f, 0xfb, 0x61, 0x64, 0x84, 0x0b, 0x5d, 
                  0x6e, 0x24, 0xaf, 0x7b, 0xe2, 0xef, 0xd8, 0x7d, 0xe7, 0xf9, 0x64, 
                  0x5e, 0x4e, 0xa6, 0xb2, 0xfa, 0xed, 0xdd, 0xeb, 0x77, 0x1f, 0xcf, 
                  0x01, 0xbc, 0xc4, 0xa6, 0xe1, 0xf5, 0x12, 0xec, 0x36, 0x13, 0x70, 
                  0xb5, 0xfa, 0xed, 0x97, 0xe7, 0x21, 0x24, 0x3a, 0x82, 0xb0, 0x29, 
                  0x55, 0x1b, 0xbf, 0x79, 0x88, 0x5f, 0x02, 0x56, 0x77, 0x80, 0x2b, 
                  0x37, 0xbb, 0xfc, 0xd1, 0x97, 0xbe, 0x58, 0x82, 0xe6, 0x66, 0x6d, 
                  0x7b, 0xfd, 0xc1, 0x66, 0x6d, 0xe3, 0x27, 0x68, 0xe8, 0xae, 0x30, 
                  0xaa, 0x83, 0xcb, 0x05, 0xdc, 0x08, 0x70, 0x22, 0x63, 0xf0, 0xda, 
                  0x4b, 0x40, 0xab, 0x6c, 0x84, 0x91, 0x8d, 0x66, 0x08, 0x3e, 0x0f, 
                  0x9b, 0x10, 0x37, 0x85, 0x02, 0x19, 0x82, 0xa3, 0x3b, 0x22, 0x80, 
                  0x50, 0x83, 0x11, 0x8d, 0xc8, 0xe3, 0xa1, 0x54, 0x0d, 0x08, 0x85, 
                  0xd3, 0x54, 0xda, 0xd3, 0x8d, 0xc4, 0x62, 0xdc, 0x03, 0x52, 0x04, 
                  0x5f, 0x95, 0x22, 0x87, 0xc4, 0x89, 0x13, 0xff, 0xf7, 0x38, 0x2b, 
                  0xd0, 0xa8, 0x2a, 0x69, 0x14, 0x71, 0xe2, 0xc4, 0x4b, 0xc0, 0x59, 
                  0x81, 0x46, 0x5d, 0x23, 0x8d, 0x22, 0x4e, 0x9c, 0x78, 0x09, 0x38, 
                  0x2b, 0xd0, 0xa8, 0x79, 0xd2, 0x28, 0xe2, 0xc4, 0x89, 0x97, 0x80, 
                  0xb3, 0x02, 0x8d, 0x5a, 0x20, 0x8d, 0x22, 0x4e, 0x9c, 0x78, 0x09, 
                  0x38, 0x2b, 0xd0, 0xa8, 0xeb, 0xa4, 0x51, 0xc4, 0x89, 0x13, 0x2f, 
                  0x01, 0x67, 0x05, 0x1a, 0xb5, 0x48, 0x1a, 0x45, 0x9c, 0x38, 0xf1, 
                  0x12, 0x70, 0x56, 0xa0, 0x51, 0x37, 0x48, 0xa3, 0x88, 0x13, 0x27, 
                  0x5e, 0x02, 0xce, 0x0a, 0x34, 0xea, 0x26, 0x69, 0x14, 0x71, 0xe2, 
                  0xc4, 0x4b, 0xc0, 0x59, 0x81, 0x46, 0xdd, 0x22, 0x8d, 0x22, 0x4e, 
                  0x9c, 0x78, 0x09, 0x38, 0x2b, 0xd0, 0xa8, 0x1a, 0x69, 0x14, 0x71, 
                  0xe2, 0xc4, 0x4b, 0xc0, 0x59, 0x81, 0x46, 0xdd, 0x25, 0x8d, 0x22, 
                  0x4e, 0x9c, 0x78, 0x09, 0x38, 0x2b, 0xd0, 0xa8, 0x7b, 0xa4, 0x51, 
                  0xc4, 0x89, 0x13, 0x2f, 0x01, 0x67, 0x05, 0x1a, 0xb5, 0x42, 0x1a, 
                  0x45, 0x9c, 0x38, 0xf1, 0x12, 0x70, 0x56, 0xa0, 0x51, 0xab, 0xa4, 
                  0x51, 0xc4, 0x89, 0x13, 0x2f, 0x01, 0x67, 0x05, 0x1a, 0xb5, 0x46, 
                  0x1a, 0x45, 0x9c, 0x38, 0xf1, 0xff, 0x9e, 0x2f, 0xb3, 0x09, 0x8d, 
                  0x5a, 0x9e, 0xca, 0x6a, 0xa6, 0x4b, 0xf7, 0x49, 0xa3, 0xca, 0xc0, 
                  0xcf, 0x62, 0xdb, 0x19, 0xac, 0xef, 0x61, 0xfd, 0x00, 0xeb, 0xc5, 
                  0x51, 0x3f, 0xac, 0xbf, 0x4e, 0x0f, 0x7f, 0x6d, 0xb5, 0x76, 0xfe, 
                  0x18, 0x30, 0x76, 0x76, 0xa2, 0xef, 0x99, 0xdc, 0xf6, 0xc9, 0xc9, 
                  0x9f, 0x27, 0xf9, 0xef, 0xcf, 0xa3, 0x7b, 0xdb, 0xef, 0xa3, 0xbc, 
                  0xef, 0xad, 0xdc, 0xc6, 0xf9, 0x09, 0xdb, 0xb7, 0x27, 0xae, 0x47, 
                  0xe0, 0x5d, 0xbf, 0xb6, 0x31, 0x7f, 0x9c, 0xb7, 0x7d, 0x8a, 0xd5, 
                  0xfe, 0x3f, 0xdd, 0x1d, 0xfb, 0xc7, 0x74, 0xe7, 0x76, 0xef, 0xec, 
                  0x9e, 0x1b, 0xee, 0xf4, 0x3b, 0xc3, 0x1f, 0xdb, 0xf5, 0x93, 0xbc, 
                  0xab, 0x1d, 0x92, 0xc3, 0x85, 0xea, 0xad, 0x45, 0x76, 0xee, 0x6c, 
                  0x6e, 0xd2, 0xe6, 0x78, 0x9a, 0x9d, 0x9e, 0x9f, 0x69, 0x76, 0x3a, 
                  0xcf, 0x85, 0xbc, 0x7d, 0x81, 0xe5, 0xff, 0x7d, 0x77, 0x69, 0x79, 
                  0x4b, 0x7b, 0xe2, 0xf2, 0x4c, 0x75, 0xe6, 0xc3, 0x2b, 0x5f, 0xee, 
                  0xbe, 0x79, 0x9b, 0x1f, 0xb2, 0xc9, 0x36, 0x6b, 0xe3, 0x62, 0x6e, 
                  0xe3, 0x7d, 0xeb, 0xeb, 0xf6, 0xda, 0xd5, 0x9b, 0xbf, 0x9f, 0x19, 
                  0x4e, 0x30, 0x18, 0x0c, 0x8e, 0x6d, 0xe9, 0xf2, 0x6e, 0xdc, 0x8e, 
                  0xea, 0xf1, 0x62, 0x9c, 0xba, 0x7b, 0x8d, 0x79, 0x77, 0x2e, 0xdf, 
                  0x34, 0xc7, 0xc7, 0x83, 0xac, 0xdc, 0xe3, 0x06, 0x3c, 0x3c, 0x47, 
                  0xf7, 0x6b, 0x6b, 0x35, 0x08, 0x34, 0x9e, 0x3f, 0x17, 0xef, 0x02, 
                  0xed, 0xc8, 0xde, 0x61, 0xd8, 0x3b, 0x0c, 0xa0, 0xd6, 0xe9, 0x1d, 
                  0x1a, 0xe9, 0x70, 0xa9, 0xb0, 0x5d, 0x84, 0x83, 0xbc, 0x60, 0x37, 
                  0xf0, 0xbd, 0x28, 0xc8, 0x06, 0x37, 0x74, 0x64, 0x8f, 0xaf, 0xb0, 
                  0xe7, 0x37, 0x80, 0x27, 0x91, 0xcc, 0x4e, 0x9f, 0x0e, 0xec, 0x71, 
                  0xf6, 0x66, 0x76, 0x94, 0xc4, 0x53, 0xbc, 0x1a, 0x19, 0xed, 0xf7, 
                  0x0e, 0x85, 0x52, 0x02, 0xf0, 0x33, 0xb2, 0x93, 0x4d, 0xe9, 0xf3, 
                  0x00, 0xdc, 0x19, 0xee, 0x38, 0xda, 0xb8, 0x80, 0x83, 0x50, 0x01, 
                  0x84, 0x67, 0x0d, 0x07, 0x11, 0x7a, 0xc7, 0x21, 0x08, 0x0d, 0x47, 
                  0x67, 0x1a, 0x52, 0x40, 0xef, 0x25, 0x70, 0x57, 0xfb, 0xa1, 0x30, 
                  0xd8, 0x11, 0x1c, 0x31, 0xf6, 0xc7, 0xce, 0x6a, 0x87, 0x38, 0x5a, 
                  0x39, 0xd6, 0x17, 0xa8, 0xc0, 0xea, 0xbe, 0xe8, 0xf8, 0x9e, 0x18, 
                  0xc6, 0xc7, 0xf7, 0x38, 0xce, 0x93, 0x88, 0x3c, 0x42, 0x89, 0xfe, 
                  0xa1, 0xaf, 0x68, 0x63, 0xdd, 0x78, 0xa8, 0x4b, 0x23, 0x3b, 0xa8, 
                  0x3d, 0x59, 0x46, 0x22, 0x83, 0xb3, 0x4b, 0x06, 0xbd, 0x57, 0x0d, 
                  0xeb, 0x87, 0x80, 0xd0, 0xd8, 0x20, 0x6d, 0x40, 0xbd, 0x97, 0x57, 
                  0xeb, 0xe8, 0xf0, 0xb0, 0x1f, 0x84, 0x3c, 0xda, 0x47, 0xe7, 0xad, 
                  0xbd, 0x40, 0x86, 0x72, 0x64, 0x07, 0x23, 0x4e, 0x40, 0x04, 0x28, 
                  0x7d, 0xc6, 0x66, 0x71, 0x8f, 0xcb, 0x3a, 0x76, 0xce, 0x44, 0x09, 
                  0x63, 0x91, 0xdd, 0x2c, 0x57, 0x36, 0x1a, 0x0c, 0x6f, 0xcd, 0x70, 
                  0xf4, 0x19, 0x1d, 0xde, 0x18, 0x5f, 0x77, 0xc7, 0x76, 0x22, 0xd7, 
                  0x70, 0x54, 0xbb, 0x27, 0x11, 0xc6, 0x31, 0x33, 0xf4, 0x15, 0xe5, 
                  0x50, 0x78, 0xbd, 0x57, 0x5d, 0x61, 0x17, 0xec, 0x6d, 0x07, 0xb0, 
                  0x45, 0x84, 0x50, 0x97, 0x18, 0x99, 0xcd, 0xde, 0x20, 0xdf, 0x08, 
                  0xa7, 0x63, 0x31, 0x35, 0x5d, 0x81, 0x43, 0x6c, 0xd2, 0x2b, 0x95, 
                  0xca, 0x78, 0x3f, 0x50, 0xf9, 0x7f, 0x94, 0xe1, 0x82, 0xc7, 0xc9, 
                  0xbc, 0xda, 0x9f, 0x4f, 0xae, 0x2f, 0x3a, 0xad, 0x68, 0x2e, 0xdc, 
                  0x6f, 0xcc, 0x4d, 0xac, 0xb3, 0xdd, 0x13, 0x9b, 0x78, 0x96, 0x62, 
                  0x29, 0x3a, 0xb3, 0x80, 0xcf, 0x48, 0xa8, 0x27, 0xfd, 0x03, 0x8d, 
                  0x3b, 0xa9, 0x2d, 0x45, 0x0b, 0xb4, 0xef, 0xb4, 0xc6, 0xfb, 0x18, 
                  0xe2, 0xa4, 0xae, 0x4d, 0x54, 0x81, 0x6f, 0x84, 0xab, 0x78, 0xdb, 
                  0x3e, 0x82, 0x4d, 0x7a, 0xf4, 0xd4, 0x01, 0x85, 0xc7, 0x49, 0x77, 
                  0x92, 0xa0, 0x7f, 0x80, 0x0f, 0xe9, 0xba, 0x70, 0x94, 0x68, 0xe0, 
                  0xa3, 0x18, 0x91, 0x1b, 0xc1, 0x43, 0xed, 0x05, 0x6d, 0x39, 0x6b, 
                  0x1f, 0xce, 0x23, 0x3b, 0x1d, 0xde, 0x3a, 0x7a, 0x7a, 0x3a, 0xc2, 
                  0x97, 0x01, 0xb7, 0x42, 0xc0, 0xa1, 0xad, 0x3d, 0xfb, 0x98, 0x9e, 
                  0x85, 0x7a, 0xff, 0x00, 0x07, 0xa3, 0x01, 0x68, 0xe1, 0x3c, 0x6e, 
                  0x2a, 0x85, 0x23, 0x21, 0x16, 0xd0, 0x69, 0x78, 0x48, 0xdf, 0xb4, 
                  0x83, 0x03, 0x50, 0x12, 0xf6, 0x70, 0xb3, 0x8b, 0xb1, 0xcd, 0x59, 
                  0x70, 0xf1, 0x2d, 0xc0, 0x7a, 0x91, 0xf2, 0x54, 0x71, 0x27, 0xc5, 
                  0x98, 0x38, 0xc6, 0xd7, 0x7f, 0x2d, 0xb2, 0x18, 0xd3, 0x8e, 0x14, 
                  0x26, 0x1d, 0xd9, 0xc9, 0xcc, 0xa0, 0xf5, 0x18, 0x3b, 0xf5, 0x5f, 
                  0x38, 0x32, 0x4e, 0x3a, 0x80, 0xd1, 0x9b, 0x48, 0xb5, 0xa3, 0x59, 
                  0x68, 0x81, 0xe8, 0xbf, 0xf0, 0x50, 0xc6, 0x9a, 0x1a, 0x1d, 0x01, 
                  0x0d, 0x51, 0xf6, 0x92, 0xa1, 0x63, 0xeb, 0x2a, 0xec, 0x49, 0xd3, 
                  0x19, 0xe7, 0x67, 0xfc, 0xca, 0x21, 0x9d, 0x54, 0x25, 0x4e, 0xb3, 
                  0x82, 0xd1, 0xc7, 0x52, 0x29, 0x89, 0xc1, 0xb5, 0x7c, 0xb4, 0x18, 
                  0x43, 0xea, 0xc9, 0x76, 0x2c, 0x5d, 0x1c, 0x7d, 0xf4, 0x0c, 0xa2, 
                  0x8e, 0x8e, 0x31, 0x63, 0x59, 0x80, 0x98, 0xf4, 0xb1, 0x1d, 0x89, 
                  0xdd, 0x84, 0x23, 0x94, 0xe4, 0xb3, 0xd0, 0xd1, 0x2d, 0x1b, 0x17, 
                  0xa6, 0x46, 0xd8, 0xf9, 0xb2, 0x15, 0x3a, 0x7a, 0x6e, 0x47, 0x00, 
                  0x06, 0x81, 0x46, 0x7c, 0x93, 0x8a, 0x14, 0xa7, 0xce, 0xad, 0xa1, 
                  0x98, 0x8c, 0xec, 0x98, 0x24, 0x8c, 0x0c, 0x2a, 0x49, 0x94, 0xf5, 
                  0xf6, 0x0d, 0x77, 0x22, 0x1b, 0x69, 0xca, 0x71, 0x64, 0xd0, 0x3e, 
                  0x7a, 0x0e, 0x26, 0xe9, 0xbf, 0x8e, 0x86, 0xab, 0x9e, 0xe7, 0x92, 
                  0xc7, 0x28, 0x18, 0x36, 0xe7, 0xda, 0xed, 0x1f, 0x8c, 0x75, 0xa7, 
                  0xff, 0x3a, 0xc1, 0x89, 0x74, 0x90, 0x3a, 0x29, 0x4e, 0xad, 0xb4, 
                  0x4d, 0x52, 0xb6, 0xe8, 0x26, 0x4d, 0x82, 0x14, 0xd0, 0xf5, 0xfe, 
                  0x8b, 0xa3, 0x67, 0x95, 0x7c, 0xad, 0x87, 0x2b, 0x6f, 0x17, 0xbc, 
                  0xce, 0x8d, 0x9b, 0x9e, 0xae, 0x7b, 0x6a, 0x57, 0x3b, 0x83, 0x15, 
                  0xf8, 0x41, 0x49, 0x6b, 0x2e, 0xc1, 0xc4, 0xb6, 0x0d, 0x6f, 0x61, 
                  0xc8, 0x75, 0xa3, 0xec, 0x32, 0xd8, 0x71, 0x2e, 0xf7, 0x44, 0xcb, 
                  0x6e, 0x33, 0xab, 0x99, 0x1a, 0x53, 0xe0, 0x28, 0x8d, 0x5e, 0x0e, 
                  0xad, 0x1c, 0x1f, 0x0b, 0x9b, 0x0e, 0x74, 0xd7, 0x11, 0xe3, 0xe0, 
                  0xea, 0xf6, 0x1e, 0x45, 0x1a, 0x83, 0x1a, 0x3a, 0x95, 0xf9, 0xe4, 
                  0xc8, 0x0a, 0xe9, 0xd9, 0xbb, 0x59, 0xec, 0x7b, 0xcb, 0x5f, 0x50, 
                  0x4b, 0x07, 0x08, 0xd3, 0xaf, 0x46, 0xb6, 0xe4, 0x05, 0x00, 0x00, 
                  0xd9, 0x58, 0x00, 0x00, 0x50, 0x4b, 0x01, 0x02, 0x14, 0x03, 0x14, 
                  0x00, 0x08, 0x00, 0x08, 0x00, 0xef, 0x45, 0x90, 0x51, 0xd3, 0xaf, 
                  0x46, 0xb6, 0xe4, 0x05, 0x00, 0x00, 0xd9, 0x58, 0x00, 0x00, 0x0a, 
                  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 
                  0x80, 0x01, 0x00, 0x00, 0x00, 0x00, 0x32, 0x30, 0x31, 0x32, 0x37, 
                  0x31, 0x2e, 0x73, 0x61, 0x76, 0x50, 0x4b, 0x05, 0x06, 0x00, 0x00, 
                  0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x38, 0x00, 0x00, 0x00, 0x1c, 
                  0x06, 0x00, 0x00, 0x00, 0x00))
@evanmiller
Copy link
Collaborator

Hi, poking around the data it looks like the SAV file has encoded a null byte in the middle of the problematic string (after "car leurs si"). ReadStat / Haven always interprets null bytes as the end of the string, but it sounds like SPSS will accept (or discard?) the null byte and use the rest of the string.

Do you see any special indicator between the phrases "car leurs si" and " èges se trouvent" when opening the file in SPSS?

@deschen1
Copy link
Author

Thanks @evanmiller for looking into it. I don't see anything suspicious in that response between these two parts of the string. Of course, I can only check this to some extent, because in SPSS I can only look at the visual output and can't inspect the encoding or the byte structure in the background.

Important question would be: what would be the correct behaviour? Would it be wrong to accept such a null byte? or would it be ok and haven should accept it?

Interestingly (and I've closed this post in favour of this new one), at some point after I did some recoding to the respective character column, even opening in SPSS led to SPSS automatically split the column into several ones after 255 characters (and the same happened when opening in R with read_sav): #559

@evanmiller
Copy link
Collaborator

The null byte would need to be stripped in ReadStat somehow, because C regards null bytes as string terminators. So we'd need an extra step in ReadStat to scan the strings for internal null bytes before handing a clean C string to haven.

Regarding the column-splitting issue #547 may be related. The SAV format splits large columns internally and assigns them numbers on top of a five-character prefix. ReadStat may not mirror this logic exactly.

@deschen1
Copy link
Author

deschen1 commented Dec 21, 2020

Thanks a lot. Sounds promising and indeed a feature (if not bug?) that would require fixing in haven?

As for #547 I saw that too, but I'm skeptical if it is related, because I have several other character columns that seem to work without problems. But I'll check if it really could be related to this issue. Thinking about it, most (if not all) requirements mentioned in #547 seem to be true for my case as well.

Update: I can confirm that the issue #547 is present in my data set as well. My original variables have names "QB5B_1" and "QB5B_2". Renaming them and reopening them with read_sav seems to not create several column splits.

evanmiller added a commit to WizardMac/ReadStat that referenced this issue Dec 21, 2020
@evanmiller
Copy link
Collaborator

The null byte issue should be fixed in WizardMac/ReadStat@7b4357d. It may take a while to reach haven.

We can continue the column-splitting discussion over at #547.

@hadley
Copy link
Member

hadley commented Apr 8, 2021

Just updated readstat, so this should now be fixed.

@hadley hadley closed this as completed Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants