read_spss() doesn’t handle non-ASCII characters in variable names properly #36
I have an example SPSS file with a few non-ASCII characters. When I import this using read_spss(), the non-ASCII characters in variable values are handled properly, but the same characters in variable names are not converted. They look like what UTF-8 byte sequences look when interpreted as ISO-8859-1 byte sequences do. If you supply an e-mail address, I can mail you the example file (the GitHub issue tracker doesn’t seem to support attachments).
Example R session:
It seems easy enough to fix:
The above example was for a SPSS file saved as ‘Unicode’. If I instead save it in the ‘native’ encoding (which seems to be Windows-1252), I get this error message:
The resulting data.frame looks like this:
Note that all the variables names are lowercase in the original SPSS file (i.e.,
The text was updated successfully, but these errors were encountered:
I couldn’t get the README instructions for updating ReadStat, but I think I have managed to install the latest GitHub version manually. It didn’t solve the problem, though.
I have sent the two SPSS sample files by e-mail to Hadley.
While my original report was from a Window system, I have now tried reading the files on a Linux system. For the first file, things work fine (even with the older version of haven). The reason is probably that my Linux system has a UTF-8 locale, and all non-ASCII characters are automatically interpreted as UTF-8 (while on Windows they were interpreted as Windows-1252). (Encoding(names(d)) still return "unknown" values.)
For the second file, I still get a warning messages, but it’s slightly different (probably because the byte sequences are interpreted as UTF-8). The resulting variables names are wrong:
In both example files, it looks like the problem is that variable names are read as being in the user’s locale (e.g., Windows-1252 on Windows, UTF-8 on Linux). When the encoding in the SPSS file doesn’t match this, the variables names used are wrong.