Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical variables for dta files #19

Open
Nosferican opened this issue May 27, 2019 · 2 comments
Open

Categorical variables for dta files #19

Nosferican opened this issue May 27, 2019 · 2 comments
Labels

Comments

@Nosferican
Copy link

using ReadStat
file = download("http://www.stata-press.com/data/r15/fullauto.dta",
                "data/ologit.dta")
data = read_dta(file)
using StatFiles, DataFrames
output = load(file) |> DataFrame

If you take a look at data you will see that categorical variables have a mapping to labels given by val_labels_keys and val_label_dict. Without taking into account that nuance, the default behavior specified here yields the values instead of the labels (e.g., rep77 gives [1, 2, 3, 4, 5] instead of ["Poor", "Fair", "Average", "Good", "Excellent"]). It might be the case for other file formats, but this is confirmed for Stata's dta.

@nalimilan
Copy link

Do you think it would be appropriate to return a CategoricalArray for such cases? Something that I've been wondering recently is whether CategoricalArrays should be able to preserve the original value code in addition to the label. In R the fact that you can represent those either as factors or as labelled numeric vectors creates a divide which isn't optimal IMO.

@nalimilan
Copy link

FWIW ReadStatTables.jl supports values labels via a special LabeledArray type. There's no way currently to convert these to CategoricalArray while preserving the ordering of levels though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants