I'd like to see a function which would turn a ragged array to a sparse one, usually when a "factor" with non-mutually exclusive choices is tentatively recorded using a group of drop-downs.
For example, if you have such a "factor" with legal values A/B/C/D recorded over three variables col1, col2 and col3.
| id |
col1 |
col2 |
col3 |
| 1 |
A |
B |
C |
| 2 |
B |
C |
NA |
| 3 |
D |
NA |
NA |
| 4 |
B |
D |
NA |
calling such a function, indicating that col1, col2 and col3 are encoding for the same information, would yield
| id |
A |
B |
C |
D |
| 1 |
T |
T |
T |
F |
| 2 |
F |
T |
T |
F |
| 3 |
F |
F |
F |
T |
| 4 |
F |
T |
F |
T |
Options would include the ability to set a prefix for the new variable names to avoid collisions, and to create the NA column.
I found this use case many times in medical surveys where disease history is badly recorded using multiple drop-down lists or sets of checkboxes. IIRC, google surveys also treats sets of checkboxes this way, with one column containing semi-colon separated values. This can be dealt with using a call to separate then a call to binarize.
Playing around a bit with spread and gather allows this behavior but this can be CPU/memory heavy on large dataframes.
There is a (pre-tidyeval) implementation in PR #288
I'd like to see a function which would turn a ragged array to a sparse one, usually when a "factor" with non-mutually exclusive choices is tentatively recorded using a group of drop-downs.
For example, if you have such a "factor" with legal values
A/B/C/Drecorded over three variablescol1,col2andcol3.calling such a function, indicating that
col1,col2andcol3are encoding for the same information, would yieldOptions would include the ability to set a prefix for the new variable names to avoid collisions, and to create the
NAcolumn.I found this use case many times in medical surveys where disease history is badly recorded using multiple drop-down lists or sets of checkboxes. IIRC, google surveys also treats sets of checkboxes this way, with one column containing semi-colon separated values. This can be dealt with using a call to
separatethen a call tobinarize.Playing around a bit with
spreadandgatherallows this behavior but this can be CPU/memory heavy on large dataframes.There is a (pre-tidyeval) implementation in PR #288