Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple missing values #170

Closed
hadley opened this issue May 31, 2016 · 4 comments
Closed

Support multiple missing values #170

hadley opened this issue May 31, 2016 · 4 comments
Assignees

Comments

@hadley
Copy link
Member

hadley commented May 31, 2016

SAS and Stata support "tagged" missing values: .A, .B, ..., .Z, and for SAS ._. SPSS supports per-column user defined missing values: either up to 3 distinct values, or a range plus 1 distinct value. Haven needs to capture these special missing values in a way that preserves regular R missing values semantics, while enabling them to be labelled.

Replaces #33 & #118

hadley added a commit that referenced this issue May 31, 2016
Fixes #91. This is a quick fix for now - will need more work to complete #170.
@hadley
Copy link
Member Author

hadley commented May 31, 2016

I think this might need two approaches: one for SAS/Stata and one for SPSS.

SAS/Stata

One way to handle tagged missing values (as suggested by @tslumley) would be to use the payload of NaNs. An IEEE 754 NaN fills the exponent field (eleven bits after the sign bit) filled with ones. R sets the final 32 bits to 1954, so that leaves 20 bits we could use to store extra information and still have the value be a valid NaN and a valid NA. Probably easiest to work with the 2nd byte of the double, treating it like a char (so tagged NA "A" would be .

Advantages: tagged missing values are treated like regular missing values (or possibly NaNs)

Disadvantages: default print methods won't show difference; will need to write C code to access the values, re-label, and format to show value.

This will also need some API for getting/setting tagged NA's, and probably a helper around for relabelling. Maybe:

is_tagged_na(x)
is_tagged_na(x, "A")
relabel_na(x, D = "did not respond", "N" = "not applicable")

SPSS

SPSS supports flagging ranges of value as missing which means it will require another approach. To be consistent with SAS/Stata it seems reasonable that these missing values should be given value NA (so by default they are treated correctly by R), but an extra attribute could store the original value so it could be retrieved if desired.

It seems reasonable to subclass labelled to make labelled_spss which would have an extra attribute called "nonmissing" (?) that records the original value of user missing values

@tslumley
Copy link

The first of the 20 bits distinguishes quiet from signalling NaNs, so we need to be careful not to touch it.

@hadley
Copy link
Member Author

hadley commented Jun 6, 2016

@tslumley oops, yes, we should use the 3rd or 4th bytes, not the 2nd.

@hadley
Copy link
Member Author

hadley commented Jul 29, 2016

Now done.

@hadley hadley closed this as completed Jul 29, 2016
@lock lock bot locked and limited conversation to collaborators Jun 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants