New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_dta not working for data.frame which columns contain unicode characters #383

Closed
withr opened this Issue Jun 18, 2018 · 3 comments

Comments

Projects
None yet
3 participants
@withr
Copy link

withr commented Jun 18, 2018

Started from v14, STATA support unicode letter appeared in column names. So the legal column names include: _, 0-9 and unicode letters (Not only latin characters).

However, the code in haven.R used to validate whether the names are legal/valid:

bad_names <- !grepl("^[A-Za-z_]{1}[A-Za-z0-9_]{0,31}$", names(data))

This is not correct, it should include another parameter: version, for version >= 14, and can use the following code for version >= 14:

bad_names <- !stringi::stri_detect_regex(names(data), "^[\\p{L}_]{1}[\\p{L}0-9_]{0,31}$")

However, validate_dta is not the only function to validate the column names.

The function 'dta_validate_name' in readstat_dta_write.c also check the column names. I tried to comment these lines:

int j;
for (j=0; name[j]; j++) {
    if (name[j] != '_' &&
            !(name[j] >= 'a' && name[j] <= 'z') &&
            !(name[j] >= 'A' && name[j] <= 'Z') &&
            !(name[j] >= '0' && name[j] <= '9')) {
        return READSTAT_ERROR_NAME_CONTAINS_ILLEGAL_CHARACTER;
    }
}
char first_char = name[0];
if (first_char != '_' &&
        !(first_char >= 'a' && first_char <= 'z') &&
        !(first_char >= 'A' && first_char <= 'Z')) {
    return READSTAT_ERROR_NAME_BEGINS_WITH_ILLEGAL_CHARACTER;
}

It seems work, but I am not sure due to my limited experience in C.

@hadley

This comment has been minimized.

Copy link
Member

hadley commented Jun 20, 2018

@evanmiller can you update the variable name check?

evanmiller added a commit to WizardMac/ReadStat that referenced this issue Jun 20, 2018

DTA 118 writer: Skip validation of multi-byte chars
Newer DTA allows Unicode characters of the Letter character class
to appear in column names. Proper validation will require some kind
of Unicode library, so in the meantime just skip the check for multi-
byte characters. (I.e. ASCII characters will continue to be validated)

See tidyverse/haven#383
@evanmiller

This comment has been minimized.

Copy link
Contributor

evanmiller commented Jun 20, 2018

It's not a full solution, but I've relaxed the code for DTA 118 and later.

WizardMac/ReadStat@4000645

@hadley hadley closed this in b720b51 Jun 20, 2018

@lock

This comment has been minimized.

Copy link

lock bot commented Dec 17, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Dec 17, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.