New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stata (before 13) needs optional encoding parameter #163

Closed
huashan opened this Issue May 22, 2016 · 5 comments

Comments

Projects
None yet
3 participants
@huashan
Copy link

huashan commented May 22, 2016

If a version 13 dta file with labels defined with non UTF-8 encoding, the 0.2.0.9000 version fails to recognize variable labels.

I think these lines should be commented out in readstat_dta.c, or should read_dta() have one more parameter for assigning default encoding?

if (ds_format < 118) {
    ctx->converter = iconv_open("UTF-8", "WINDOWS-1252");
}
@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented May 22, 2016

Hi,

What encoding do your labels use?

@huashan

This comment has been minimized.

Copy link

huashan commented May 22, 2016

@evanmiller
Thanks for the reply. The encoding is GB2312. I also test a version 14 dta file, same result as above.
With those three lines commented out, I could use Encoding(x)<- in R to set the correct encoding for var labels.

@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented May 22, 2016

@hadley I've extended the C API to allow manual specification of the file encoding. The trouble is that pre-14 Stata uses the system encoding (usually Win 1252) but does not indicate what that encoding is anywhere in the file. For kicks I also allow specifying the output encoding, which defaults to UTF-8. Here's the API diff from WizardMac/ReadStat@c4e0d48:

// Usually inferred from the file, but sometimes a manual override is desirable.
// In particular, pre-14 Stata uses the system encoding, which is usually Win 1252
// but could be anything. `encoding' should be an iconv-compatible name.
readstat_error_t readstat_set_input_character_encoding(readstat_parser_t *parser, const char *encoding);

// Defaults to UTF-8. Pass in NULL to disable transliteration.
readstat_error_t readstat_set_output_character_encoding(readstat_parser_t *parser, const char *encoding);

@hadley hadley changed the title stata labels with non UTF-8 encoding Stata (before 13) needs optional encoding parameter May 30, 2016

@hadley

This comment has been minimized.

Copy link
Member

hadley commented May 30, 2016

@evanmiller is it safe to default this to UTF-8? Or does stata 14 use different types of encoding and I need to respect the type used for the file?

@hadley hadley closed this in 6424528 May 30, 2016

@hadley

This comment has been minimized.

Copy link
Member

hadley commented May 30, 2016

@evanmiller assumed it probably should only be set when the user specifically wants an override.

@lock lock bot locked and limited conversation to collaborators Jun 27, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.