Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_sav fails when character variables contain special characters #258

Closed
vidhjel opened this issue Jan 6, 2017 · 14 comments
Closed

write_sav fails when character variables contain special characters #258

vidhjel opened this issue Jan 6, 2017 · 14 comments
Labels
bug an unexpected problem or unintended behavior readstat

Comments

@vidhjel
Copy link

vidhjel commented Jan 6, 2017

x <- as.data.frame(matrix(nr=2, byrow=T, c(
    "a","Normalresepter",
    "b","Blåreseptordningen §§ 2, 3a, 3b, 4 og 5 (gammel ordning §§ 2, 3, 4, 9, og 10a)")))
for (i in 1:2) x[,i] <- as.character(x[,i])
for (j in max(nchar(x[,2])):1){
cat(j,"\n")
x[,2]<-substr(x[,2],1,j)
cat(x[2,2],"\n")
tmp <- try(write_sav(x, "test.sav"))
}

The above code yield the error message below when x[2,2] is truncated to the 8, 16, 22-24, 30-32, 38-40, 54-56, 60-64, 68-72, or 76-78 first characters. Otherwise it works fine.
Error in write_sav_(data, normalizePath(path, mustWork = FALSE)) :
Writing failure: A provided string value was longer than the available storage size of the specified column.

@jllipatz
Copy link

A simpler example with write_sas :

str(a2)
'data.frame': 1 obs. of 2 variables:
$ SUFFIXE : chr ""
$ COMPLEMENT: chr "14, 16 et 18= même immeuble"
write_sas(a2,"U:/R/Test/bsa86a.sas7bdat")
Error in eval(substitute(expr), envir, enclos) :
Writing failure: A provided string value was longer than the available storage size of the specified column.

@hadley
Copy link
Member

hadley commented Jan 25, 2017

This works for me:

df <- tibble::tibble(x = c("Normalre", "Blåresep"))
write_sas(df, tempfile())
write_sav(df, tempfile())

But it's possibly because I'm on a mac. Could you please see if that works for you?

Otherwise, a reprex in that style would be much appreciated.

@hadley hadley added the reprex needs a minimal reproducible example label Jan 25, 2017
@vidhjel
Copy link
Author

vidhjel commented Jan 26, 2017

Thank you for your suggestion, unfortunately it doesn't work:

> library(haven)
> df <- tibble::tibble(x = c("Normalre", "Blåresep"))
> write_sas(df, tempfile())
Error in write_sas_(data, normalizePath(path, mustWork = FALSE)) : 
  Writing failure: A provided string value was longer than the available storage size of the specified column.

By the way,

df <- tibble::tibble(y=c("a","b"),x = c("Normalre", "Blårese")) # Works fine
df <- tibble::tibble(y=c("a","b"),x = c("Normalr", "Blårese")) # Yields error message
df <- tibble::tibble(y=c("a","b"),x = c("Normalresept", "Blåresept")) # Works fine 
df <- tibble::tibble(y=c("a","b"),x = c("Normalresept", "Blåresept")) # Works fine
df <- tibble::tibble(y=c("a","b"),x = c("Normalresep", "Blåresept")) # Works fine
df <- tibble::tibble(y=c("a","b"),x = c("Normalrese", "Blåresept")) # Works fine
df <- tibble::tibble(y=c("a","b"),x = c("Normalres", "Blåresept")) # Yields error message

@hadley
Copy link
Member

hadley commented Jan 26, 2017

GitHub issues use markdown and they're much easier to read if you learn a little bit about it. The most important thing is to put your R code inside a block that starts with ```R and ends with ```. I've done that for you here, but I'd really appreciate it if you would format it yourself in the future.

@hadley
Copy link
Member

hadley commented Jan 26, 2017

Can you please also work on making your reprex minimal, like mine? I have lots of examples that work - I just need to have one example that I can easily copy and paste into R that demonstrates the problem.

@evanmiller
Copy link
Collaborator

evanmiller commented Jan 26, 2017 via email

@hadley hadley added bug an unexpected problem or unintended behavior readstat and removed reprex needs a minimal reproducible example labels Jan 27, 2017
@huftis
Copy link
Contributor

huftis commented Jul 31, 2017

I can confirm that this bug is still present in the latest version of haven (1.1.0) on Windows, and it makes it impossible to write SPSS-files with non-ASCII characters. Here’s a reprex:

library(haven)
df <- tibble::tibble(x = "Blaresep", y = "Blåresep")
write_sav(df[,1], tempfile()) # This works
write_sav(df[,2], tempfile()) # This doesn’t
# Error in write_sas_(data, normalizePath(path, mustWork = FALSE)) : 
# Writing failure: A provided string value was longer than the available storage size of the specified column.

devtools::session_info()

Session info -----------------------------------
 setting  value                        
 version  R version 3.4.0 (2017-04-21) 
 system   x86_64, mingw32              
 ui       RStudio (1.0.143)            
 language (EN)                         
 collate  Norwegian-Nynorsk_Norway.1252
 tz       Europe/Berlin                
 date     2017-07-31                   

Packages ---------------------------------------
 package    * version date       source        
 backports    1.1.0   2017-05-22 CRAN (R 3.4.0)
 base       * 3.4.0   2017-04-21 local         
 callr        1.0.0   2016-06-18 CRAN (R 3.4.1)
 clipr        0.3.3   2017-06-19 CRAN (R 3.4.1)
 compiler     3.4.0   2017-04-21 local         
 datasets   * 3.4.0   2017-04-21 local         
 devtools     1.13.2  2017-06-02 CRAN (R 3.4.0)
 digest       0.6.12  2017-01-27 CRAN (R 3.4.0)
 evaluate     0.10.1  2017-06-24 CRAN (R 3.4.0)
 forcats      0.2.0   2017-01-23 CRAN (R 3.4.0)
 graphics   * 3.4.0   2017-04-21 local         
 grDevices  * 3.4.0   2017-04-21 local         
 haven      * 1.1.0   2017-07-09 CRAN (R 3.4.1)
 htmltools    0.3.6   2017-04-28 CRAN (R 3.4.0)
 knitr        1.16    2017-05-18 CRAN (R 3.4.0)
 magrittr     1.5     2014-11-22 CRAN (R 3.4.0)
 memoise      1.1.0   2017-04-21 CRAN (R 3.4.0)
 methods    * 3.4.0   2017-04-21 local         
 Rcpp         0.12.12 2017-07-15 CRAN (R 3.4.1)
 reprex     * 0.1.1   2017-01-13 CRAN (R 3.4.1)
 rlang        0.1.1   2017-05-18 CRAN (R 3.4.0)
 rmarkdown    1.6     2017-06-15 CRAN (R 3.4.0)
 rprojroot    1.2     2017-01-16 CRAN (R 3.4.0)
 rstudioapi   0.6     2016-06-27 CRAN (R 3.4.0)
 stats      * 3.4.0   2017-04-21 local         
 stringi      1.1.5   2017-04-07 CRAN (R 3.4.0)
 stringr      1.2.0   2017-02-18 CRAN (R 3.4.0)
 tibble       1.3.3   2017-05-28 CRAN (R 3.4.0)
 tools        3.4.0   2017-04-21 local         
 utils      * 3.4.0   2017-04-21 local         
 whisker      0.3-2   2013-04-28 CRAN (R 3.4.0)
 withr        1.0.2   2016-06-20 CRAN (R 3.4.0)

@huftis
Copy link
Contributor

huftis commented Jul 31, 2017

Note that even when write_sav() apparently succeeds (i.e. there is no error message), the resulting file may not contain the correct data (example: write_sav(tibble(x="øø"), "test.sav") – the x variable has the string ø, not øø when opened in SPSS).

I think an easy way to fix this is (or a work-around, for people having this problem), it to reencode all strings to UTF-8 before saving them (df[] = lapply(df, function(x) iconv(x, to="UTF-8"))).

@hadley
Copy link
Member

hadley commented Jan 7, 2018

@evanmiller any thoughts on whether this is a haven or a readstat problem? It's possible I've missed a conversion to utf-8 somewhere

@evanmiller
Copy link
Collaborator

@hadley Based on @huftis's comment it sounds like a haven problem to me.

@hadley
Copy link
Member

hadley commented Jan 16, 2018

@evanmiller I'm either missing something obvious, I have misunderstood the readstat API, or there's a readstat bug.

  1. readstat_insert_string_value() is called from my C++ insertValue() at https://github.com/tidyverse/haven/blob/master/src/DfWriter.cpp#L298-L304

  2. insertValue() is called from https://github.com/tidyverse/haven/blob/master/src/DfWriter.cpp#L143 (I verified this is the line executed in @huftis's example by inserting a print statement), which uses string_utf8()

  3. string_utf8() calls the R API, ensuring that the result is always encoded as UTF-8: https://github.com/tidyverse/haven/blob/master/src/DfWriter.cpp#L9-L11

I don't see a way to set the encoding for the output, so I'm assuming utf-8, but maybe that assumption is wrong?

@evanmiller
Copy link
Collaborator

I believe the bug is here:

https://github.com/tidyverse/haven/blob/master/src/DfWriter.cpp#L266

Should be measuring UTF-8 length on this line.

@hadley
Copy link
Member

hadley commented Jan 16, 2018

Oh, I'm an idiot, thanks!

@hadley hadley closed this as completed in e088e50 Jan 16, 2018
@lock
Copy link

lock bot commented Jul 15, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jul 15, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior readstat
Projects
None yet
Development

No branches or pull requests

5 participants