-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
missingSummary Function (Potential Problem with exportRecords?) #5
Comments
EDIT: Same result with the missingSummary_offline function. |
This is an interesting case. Essentially what this says is that the branching logic parsing in the library does not match what's going on in REDCap. If you have the time, please outline the branching logic in REDCap and I'll see if I can reproduce. |
Happy to send attach some example data + the data dictionary. It's strange because I'm using the API pull for both the data and the dictionary in the initial call. I'll update this in an hour or so with the the example data. |
Has anything changed in how the branching logic is written to the metadata in the past...I don't know how long it's been since this was written. I would think there's a nontrivial probability that |
Oh, interesting. I have no clue. Here's the data and dictionary. For the data, I've given just one case, but across a few different forms. I caught the missingness error on the branching logic for variables "ser_amino" and "amino_acids_res", the latter of which has a dependency on the former being equal to "1". In the example data, "ser_amino==1", so "amino_acids_res" was shown, but there was a blank response. In theory, missingSummary should catch this, but it doesn't. It catches other missingness, but not this one. |
Have you done anything to this metadata file? I ask because the column order in the file you attached is nothing like the column order the function is assuming. |
Ah, the metafile a gave you was from a different library (me trying to circumvent the issues here while waiting to hear back). Here's the actual data dictionary, which I pulled using the following code in R: meta_data <- exportMetaData(rcon) |
Hi all -- just following up on this. Any ideas? |
It doesn't look like processing the branching logic has changed; I think I can rule that out. I'm afraid I haven't made much progress on this otherwise. I can't get missingSummary_offline to run without encountering an error. I'm running into problems where the code can't find variables it seems to want to find. As far as I can tell, any variable name that would exceed 32 characters is getting truncated. So "first_tx_other_medications_tx___99" is showing up at "first_tx_other_medications_tx___" in the data set. Which is a problem because it's the same variable name for every one of those checkbox options. The form complete variables are also kicking up trouble. For instance ""seizuresepilepsy_history_complete" is appearing in the data set at ""seizuresepilepsy_history_complet". Again, exactly 32 characters (which I think is a SAS or SPSS limit). I can't say if that has anything to do with the failure to catch certain missingness (I doubt it), but it is making it difficult to replicate the finding. |
Yeah -- I feel you there! The 32 character limit is true too for Stata. I ended up needing to recode variables on export via Stata & R to remove characters from the middle, preserving the beginning and ending characters. I inherited the database, so I'm in the trenches banging my head against a wall right along with you. :) Here's that "import" code in R (ignore the "///". I ran the code using RCALL in Stata): library(devtools); /// source("$Code/501c370418abb58bee78-4a0546caee0928f9ad91ad850cc751eb2d5d3675/missingSummary.R"); /// library(redcapAPI); /// library(stringr); /// library(tidyr); /// library(writexl); /// options(redcap_api_url = "https://XXX); /// rcon <- redcapConnection(token = "XXX"); /// records <- exportRecords(rcon, factors=FALSE, labels=FALSE, /// dates=FALSE, survey=FALSE, dag=TRUE, /// batch.size=-1); /// max_length <- 30; /// records_shortened <- records; /// for (col in colnames(records)) {; /// if (nchar(col) >= max_length) {; /// if (grepl("medication", col, ignore.case = TRUE)) {; /// new_col <- gsub("medication", "med", col, ignore.case = TRUE); /// } else if (grepl("seizure", col, ignore.case = TRUE)) {; /// new_col <- gsub("seizure", "seiz", col, ignore.case = TRUE); /// } else if (grepl("history", col, ignore.case = TRUE)) {; /// new_col <- gsub("history", "hx", col, ignore.case = TRUE); /// } else if (grepl("abnormalities", col, ignore.case = TRUE)) {; /// new_col <- gsub("abnormalities", "abnorm", col, ignore.case = TRUE); /// } else if (grepl("malforma", col, ignore.case = TRUE)) {; /// new_col <- gsub("malforma", "mal", col, ignore.case = TRUE); /// } else if (grepl("etiology", col, ignore.case = TRUE)) {; /// new_col <- gsub("etiology", "eti", col, ignore.case = TRUE); /// } else if (grepl("structural", col, ignore.case = TRUE)) {; /// new_col <- gsub("structural", "str", col, ignore.case = TRUE); /// } else if (grepl("second", col, ignore.case = TRUE)) {; /// new_col <- gsub("seoncd", "sec", col, ignore.case = TRUE); /// } else if (grepl("classifica", col, ignore.case = TRUE)) {; /// new_col <- gsub("classifica", "class", col, ignore.case = TRUE); /// } else {; /// new_col <- substring(col, 1, max_length); /// }; /// colnames(records_shortened)[colnames(records_shortened) == col] <- new_col; /// }; /// }; /// write_xlsx(records_shortened,"XXX/REDcap Test.xlsx"); |
What is the simplest reproducible example of this behavior? |
Alrighty. Simplest approach: Download these two files: Then, run the following R code: source("[Directory Path Here]/missingSummary.R") library(redcapAPI) Miss <- missingSummary_offline("[Directory Path Here]/redcap_data.csv","[Directory Path Here]/data_dictionary.csv") Miss The output will be this: record_id redcap_repeat_instrument n_missing missing 1 123456 0 2 123456 eeg 1 based_score 3 123456 eeg 1 based_score 4 123456 imaging 1 mri_uploaded However, I know it's not catching everything because when spot-checking, I know that when "ser_amino_acids" equals 1, then there should be a response expected in "amino_acids_res". In the data, amino_acids_res=="" and therefore should show up in the Miss report, but it isn't there. The data dictionary line for "amino_acids_res" is: amino_acids_res | genetics_metabolics | radio | Amino acids | 1, Normal \| 2, Abnormal \| 3, Uncertain [ser_amino_acids] = '1' | | | metab_results | Anything else I can clarify? |
This is going to be long, but I'm not sure how to describe what's happening very succinctly. TLDR: the function has a pretty serious flaw. I made a data set with the following when I export it from REDCap.
What we want to check is whether BUT (see all that formatting? this must be important),
Now when we compare The offending part of the code is https://gist.github.com/nutterb/501c370418abb58bee78#file-missingsummary-r-L76-L99, where we are overwriting
I'll post a correction in a few minutes. |
there's a correction up. @spgarbet, if you'd like, I can work up a cleaner version of this, since you've expressed interest in including it in the package. |
Wow, you're awesome. And your explanation makes perfect sense. Thank you for all the help. I'll give this a shot and let you know what comes back. |
Getting closer: Here's the output now: record_id redcap_repeat_instrument n_missing missing 1 123456 NA NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, outcome_2y_episurg_yn, NA, NA, amino_acids_res, acylcarnitines_res, NA, NA 2 123456 eeg 1 based_score 3 123456 eeg 1 based_score 4 123456 imaging 1 mri_uploaded Any idea why the NA is being spit out? Is that due to checkboxed values? (i.e. race = race____1, race____2, etc)? |
I don't think so. I just ran it against a database with the full data dictionary. I'm not getting that. I also just posted another update, so maybe I inadvertently fixed something, but I doubt it. I'm wondering if the lack of a value in |
Thanks -- I'll give it a shot. |
New error: Error in .missingSummary_isMissingInField(records_orig, meta_data, logic) : object 'records_orig' not found In addition: Warning message: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on '/Users/xxxx/Downloads/redcap_data.csv' Called from: .missingSummary_isMissingInField(records_orig, meta_data, logic) Seems to stem from L150-152. |
I'd love to include it in the package. We'd need a good test case for regression testing on updates. |
@mcfibb, Ah, missed declaring |
IT WORKS! @nutterb -- thank you so, so much. |
723rd time is a charm! |
Looks like we'll need to make that 724 times. Something isn't right: in crosschecking missingness, now new non-missing are being counted as missing. To replicate, download the following files: Then run missingSummary_offline. You'll see there's 121 missing items, including "amino_acids_res". While true, you'll see that the "gatekeeper" variable of "ser_amino_acirds" is NOT "1", and therefore, "amino_acids_res" should not be considered missing. (Sidenote: I know we keep using these variables, but it's more of a convenience thing than anything else. It's just emblematic of the larger issue.) Ideas? |
I made the rather absurd assumption that the parsed logic list would have the same length as records. checkbox variables blow that assumption out of the water. I've updated the function to reference the parsed logic by field name. There are also some small improvements in counting the missing variables that avoid implicitly turning data frames into matrices. Seems to be working well on your data example and the very small data example I'm building out to test this more robustly. Let me know what blows up when you generalize it over your larger data. |
Thanks a ton! Will try out later today. :) |
Seems to work so far. Will keep you posted! |
Alright -- back again with another episode of "WTF: It's Not Just Me, Right?" Before I get into detail, here's the bigger question: If a form is marked "Incomplete", will missingness show up in this report or is that form excluded from the summary code? |
It should still show up. Does your project have events or repeated forms.
I discovered today that this can only handle records where all the
branching logic is on the same row. So if there are prereqs in earlier
events, we won't get the expected results.
…On Wed, Mar 8, 2023, 6:46 PM mcfibb ***@***.***> wrote:
Alright -- back again with another episode of "WTF: It's Not Just Me,
Right?"
Before I get into detail, here's the bigger question: If a form is marked
"Incomplete", will missingness show up in this report or is that form
excluded from the summary code?
—
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAODI7OELRBNG55D3DU3QA3W3EK6LANCNFSM6AAAAAAUVXKT3M>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It does (repeated forms), but it turns out that isn't the issue, and neither is the "Incomplete" thing either. Checkboxes. Why did it have to be checkboxes... They are marked "0" at default, so of course they'd never show up as missing. So, I'm trying to figure out how to parse the file such that if all subvariables associated with that checkbox variable "=0", then include the checkbox primary variable as "missing". I wrote some code on my own for that in Stata (which I use in tandem with R using Rcall), but I'm not quite sure how I'd combine it with the results of missingSummary. I'm guessing checkbox missingness isn't an included functionality at this point? |
I haven't ever considered checkbox missingness, mostly because whether all of them being unchecked is an indication of 'missing' is definitely situational. I'm inclined to think this would be easier to extract with a calculated field (see below for an example).
I'm not familiar with Stata enough to talk about joins, but if you have your missingness value tied to the record id (and event/form IDs), a left join would be the approach I would take. |
This code is now published on the main branch as part of the upcoming 2.4.2 release. |
@mcfibb would you reach out to me at my email < shawn.garbett (at) vumc.org > ? |
Hi everyone -- long time, no chat! So, I'm back to my old REDcap data quality pulls, and I'm making good use of the missingSummary feature. I haven't updated the library because I don't want to break anything that I programmed a year ago, but I am running into some issues from where I left off vis a vis checkbox missingness. I did indeed write my own "side-chained" code to merge missingSummary data with my own checkbox missingness code, but I made the (foolish) assumption in my own code that parsing logic wouldn't apply to checkboxes... which of course is not the case. (A visible checklist may trigger an invisible checklist). Turns out, this happens a lot in my data, so I've been producing quality reports that show missingness for checkbox data (i.e. all 0's) when in fact they've never been shown at all, so it would make sense they all have zeros. But my workaround hasn't considered all of the parsing logic for each checkbox. So, with that long-winded context out of the way: has there been any progress on identifying checkbox missingness using missingSummary? (Based on the following logic: 1) the checkbox field passes the logic and was shown, and 2) all values of the checkbox field sum to 0 (e.g. nothing was checked)) Any help would be appreciated! |
We've addressed a lot of issues recently around checkboxes but not that one. missingSummary has been revised a fair bit as well and there is a pull request awaiting. One issue with the design of a checkbox is that by definition they always have a value. I.e. by their nature and definition they are never missing (this causes weird issues with detecting emptyRows). Your proposal is for a group of checkboxes that they are considered missing or empty when the entire group is unchecked. I think that could be accomplished but it would be a choice on the part of the user, i.e. it's an external definition of missing that is not natural to the domain. That said, I can see the usefulness of such a proposal. I am reaching trying to think how it could be defined via branch logic or smart variables. I'll open a new ticket for that discussion. |
Thanks Shawn — and totally agree. In the clinical research I touch
leveraging REDcap, this is definitely a common enough issue where I’d be
more than happy to work with you closely to find a fix. :)
…On Fri, Mar 15, 2024 at 12:40 PM Shawn Garbett ***@***.***> wrote:
We've addressed a lot of issues recently around checkboxes but not that
one. missingSummary has been revised a fair bit as well and there is a pull
request awaiting.
One issue with the design of a checkbox is that by definition they
*always* have a value. I.e. by their nature and definition they are never
missing (this causes weird issues with detecting emptyRows). Your proposal
is for a group of checkboxes that they are considered missing or empty when
the entire group is unchecked. I think that could be accomplished but it
would be a choice on the part of the user, i.e. it's an external definition
of missing that is not natural to the domain. That said, I can see the
usefulness of such a proposal. I am reaching trying to think how it could
be defined via branch logic or smart variables. I'll open a new ticket for
that discussion.
—
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALIA5CHSC4EXISIZMVHFBKDYYNFBXAVCNFSM6AAAAAAUVXKT3OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBQGMZTMOBZGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi there -- so appreciate this library! Hoping this is the right place to ask this:
On first glance, the missingSummary ancillary function for redcapAPI works perfectly. However, when cross-checking true missingness with REDcap open, there are a number of branching logic values where conditions are met, missingness still occurs, but the code doesn't report the missing value. Is it possible this is related to one of the options in the exportRecords function being the wrong value or perhaps the function isn't pulling in all available data?
Here's a link to the missingSummary gist: https://gist.github.com/nutterb/501c370418abb58bee78
The text was updated successfully, but these errors were encountered: