New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_spss() doesn’t handle non-ASCII characters in variable names properly #36

Closed
huftis opened this Issue Mar 5, 2015 · 10 comments

Comments

Projects
None yet
4 participants
@huftis
Copy link
Contributor

huftis commented Mar 5, 2015

I have an example SPSS file with a few non-ASCII characters. When I import this using read_spss(), the non-ASCII characters in variable values are handled properly, but the same characters in variable names are not converted. They look like what UTF-8 byte sequences look when interpreted as ISO-8859-1 byte sequences do. If you supply an e-mail address, I can mail you the example file (the GitHub issue tracker doesn’t seem to support attachments).

Example R session:

> library(haven)
> d=read_spss("spsstest.sav")
> d # Wrong characters in the column header
  abc abcæøå testµ
1 foo         1     10
2 bår         2     20
3 æøå         3     30

It seems easy enough to fix:

> Encoding(names(d))
[1] "unknown" "unknown" "unknown"
> Encoding(names(d))="UTF-8"
> d  # Correct characters in the column header
  abc abcæøå testµ
1 foo      1    10
2 bår      2    20
3 æøå      3    30

The above example was for a SPSS file saved as ‘Unicode’. If I instead save it in the ‘native’ encoding (which seems to be Windows-1252), I get this error message:

> d=read_spss("spsstest2.sav")
Failed to find ABCÆØÅ

Failed to find TESTµ

The resulting data.frame looks like this:

> d
  abc ABCÆØÅ TESTµ
1 foo      1    10
2 bår      2    20
3 æøå      3    30

Note that all the variables names are lowercase in the original SPSS file (i.e., abc, abcæøå and testµ), but two of them have been converted to uppercase in the data.frame.

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252 
[2] LC_CTYPE=Norwegian-Nynorsk_Norway.1252   
[3] LC_MONETARY=Norwegian-Nynorsk_Norway.1252
[4] LC_NUMERIC=C                             
[5] LC_TIME=Norwegian-Nynorsk_Norway.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] haven_0.1.1

loaded via a namespace (and not attached):
[1] Rcpp_0.11.4 tools_3.1.1
@sjPlot

This comment has been minimized.

Copy link

sjPlot commented Mar 5, 2015

See https://github.com/hadley for contact Email.

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Mar 5, 2015

The "Failed to find" errors come from my end of the code (the ReadStat library). I think I have tracked down the problem. I believe that the uppercasing issue is a result of this. I'll push some code in a minute for you to test.

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Mar 5, 2015

This commit should fix the uppercasing and "Failed to find" errors:

WizardMac/ReadStat@5eccc7b

See the haven README for instructions on testing this code.

@huftis

This comment has been minimized.

Copy link
Contributor

huftis commented Mar 5, 2015

I couldn’t get the README instructions for updating ReadStat, but I think I have managed to install the latest GitHub version manually. It didn’t solve the problem, though.

I have sent the two SPSS sample files by e-mail to Hadley.

While my original report was from a Window system, I have now tried reading the files on a Linux system. For the first file, things work fine (even with the older version of haven). The reason is probably that my Linux system has a UTF-8 locale, and all non-ASCII characters are automatically interpreted as UTF-8 (while on Windows they were interpreted as Windows-1252). (Encoding(names(d)) still return "unknown" values.)

For the second file, I still get a warning messages, but it’s slightly different (probably because the byte sequences are interpreted as UTF-8). The resulting variables names are wrong:

> d=read_spss("~/tmp/spsstest2.sav")
Failed to find ABCÆØÅ

Failed to find TESTµ

> d
  abc ABC\xc6\xd8\xc5 TEST\xb5
1 foo               1       10
2 bår               2       20
3 æøå               3       30

In both example files, it looks like the problem is that variable names are read as being in the user’s locale (e.g., Windows-1252 on Windows, UTF-8 on Linux). When the encoding in the SPSS file doesn’t match this, the variables names used are wrong.

@huftis

This comment has been minimized.

Copy link
Contributor

huftis commented Mar 6, 2015

I have now uploaded the two test files to http://huftis.org/nedlasting/spss/

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Mar 6, 2015

Thanks. I am able to reproduce this issue. I'll let you know if I make progress on it.

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Mar 6, 2015

Ok, I've been able to track down the issue. It looks like the character encoding information appears after the variable list in the file, so the variable names aren't being properly converted. I'll need to rework some things to fix the issue properly.

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Mar 6, 2015

The latest ReadStat commits (WizardMac/ReadStat@c93cb10 and WizardMac/ReadStat@d28b856) should fix this issue. I tested against your test file and everything seems to work OK.

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Mar 6, 2015

And haven is updated, so give the dev version another go.

@huftis

This comment has been minimized.

Copy link
Contributor

huftis commented Mar 6, 2015

I’m not able to test this on Windows, but on Linux everything now works fine. Thank you so much for fixing this!

@huftis huftis closed this Mar 6, 2015

@lock lock bot locked and limited conversation to collaborators Jun 27, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.