Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_dta not working for data.frame which columns contain unicode characters #383

withr opened this issue Jun 18, 2018 · 3 comments


Copy link

@withr withr commented Jun 18, 2018

Started from v14, STATA support unicode letter appeared in column names. So the legal column names include: _, 0-9 and unicode letters (Not only latin characters).

However, the code in haven.R used to validate whether the names are legal/valid:

bad_names <- !grepl("^[A-Za-z_]{1}[A-Za-z0-9_]{0,31}$", names(data))

This is not correct, it should include another parameter: version, for version >= 14, and can use the following code for version >= 14:

bad_names <- !stringi::stri_detect_regex(names(data), "^[\\p{L}_]{1}[\\p{L}0-9_]{0,31}$")

However, validate_dta is not the only function to validate the column names.

The function 'dta_validate_name' in readstat_dta_write.c also check the column names. I tried to comment these lines:

int j;
for (j=0; name[j]; j++) {
    if (name[j] != '_' &&
            !(name[j] >= 'a' && name[j] <= 'z') &&
            !(name[j] >= 'A' && name[j] <= 'Z') &&
            !(name[j] >= '0' && name[j] <= '9')) {
char first_char = name[0];
if (first_char != '_' &&
        !(first_char >= 'a' && first_char <= 'z') &&
        !(first_char >= 'A' && first_char <= 'Z')) {

It seems work, but I am not sure due to my limited experience in C.

Copy link

@hadley hadley commented Jun 20, 2018

@evanmiller can you update the variable name check?

evanmiller added a commit to WizardMac/ReadStat that referenced this issue Jun 20, 2018
Newer DTA allows Unicode characters of the Letter character class
to appear in column names. Proper validation will require some kind
of Unicode library, so in the meantime just skip the check for multi-
byte characters. (I.e. ASCII characters will continue to be validated)

See tidyverse/haven#383
Copy link

@evanmiller evanmiller commented Jun 20, 2018

It's not a full solution, but I've relaxed the code for DTA 118 and later.


@hadley hadley closed this in b720b51 Jun 20, 2018
Copy link

@lock lock bot commented Dec 17, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue.

@lock lock bot locked and limited conversation to collaborators Dec 17, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants