missingSummary Function (Potential Problem with exportRecords?) #5

mcfibb · 2023-02-08T20:44:21Z

Hi there -- so appreciate this library! Hoping this is the right place to ask this:

On first glance, the missingSummary ancillary function for redcapAPI works perfectly. However, when cross-checking true missingness with REDcap open, there are a number of branching logic values where conditions are met, missingness still occurs, but the code doesn't report the missing value. Is it possible this is related to one of the options in the exportRecords function being the wrong value or perhaps the function isn't pulling in all available data?

Here's a link to the missingSummary gist: https://gist.github.com/nutterb/501c370418abb58bee78

mcfibb · 2023-02-08T22:38:52Z

EDIT: Same result with the missingSummary_offline function.

spgarbet · 2023-02-23T15:05:22Z

This is an interesting case. Essentially what this says is that the branching logic parsing in the library does not match what's going on in REDCap. If you have the time, please outline the branching logic in REDCap and I'll see if I can reproduce.

mcfibb · 2023-02-23T16:26:50Z

Happy to send attach some example data + the data dictionary. It's strange because I'm using the API pull for both the data and the dictionary in the initial call.

I'll update this in an hour or so with the the example data.

nutterb · 2023-02-23T16:30:58Z

Has anything changed in how the branching logic is written to the metadata in the past...I don't know how long it's been since this was written. I would think there's a nontrivial probability that parseBranchingLogic is out of date.

mcfibb · 2023-02-23T16:54:40Z

Oh, interesting. I have no clue. Here's the data and dictionary.

For the data, I've given just one case, but across a few different forms. I caught the missingness error on the branching logic for variables "ser_amino" and "amino_acids_res", the latter of which has a dependency on the former being equal to "1". In the example data, "ser_amino==1", so "amino_acids_res" was shown, but there was a blank response. In theory, missingSummary should catch this, but it doesn't. It catches other missingness, but not this one.

meta.csv
REDcap API Export.csv

nutterb · 2023-02-23T18:27:59Z

Have you done anything to this metadata file?

I ask because the column order in the file you attached is nothing like the column order the function is assuming.

mcfibb · 2023-02-23T20:35:38Z

Ah, the metafile a gave you was from a different library (me trying to circumvent the issues here while waiting to hear back). Here's the actual data dictionary, which I pulled using the following code in R:

meta_data <- exportMetaData(rcon)
write_excel_csv(meta_data,"~Downloads/meta.csv")

meta.csv

mcfibb · 2023-02-27T19:46:34Z

Hi all -- just following up on this. Any ideas?

spgarbet · 2023-02-27T20:04:33Z

I'm hoping @nutterb has an idea. I have a test framework setup for issues like this (working Issue #6). If I can reproduce this and understand the root cause then we can turn it around.

nutterb · 2023-02-27T20:46:00Z

It doesn't look like processing the branching logic has changed; I think I can rule that out.

I'm afraid I haven't made much progress on this otherwise. I can't get missingSummary_offline to run without encountering an error. I'm running into problems where the code can't find variables it seems to want to find. As far as I can tell, any variable name that would exceed 32 characters is getting truncated. So "first_tx_other_medications_tx___99" is showing up at "first_tx_other_medications_tx___" in the data set. Which is a problem because it's the same variable name for every one of those checkbox options.

The form complete variables are also kicking up trouble. For instance ""seizuresepilepsy_history_complete" is appearing in the data set at ""seizuresepilepsy_history_complet". Again, exactly 32 characters (which I think is a SAS or SPSS limit). I can't say if that has anything to do with the failure to catch certain missingness (I doubt it), but it is making it difficult to replicate the finding.

mcfibb · 2023-02-27T20:57:14Z

Yeah -- I feel you there! The 32 character limit is true too for Stata. I ended up needing to recode variables on export via Stata & R to remove characters from the middle, preserving the beginning and ending characters. I inherited the database, so I'm in the trenches banging my head against a wall right along with you. :)

Here's that "import" code in R (ignore the "///". I ran the code using RCALL in Stata):

                                library(devtools);			///
				source("$Code/501c370418abb58bee78-4a0546caee0928f9ad91ad850cc751eb2d5d3675/missingSummary.R");	///
				library(redcapAPI);			///
				library(stringr);			///
				library(tidyr);				///
				library(writexl);	///
				options(redcap_api_url = "https://XXX);	///
				rcon <- redcapConnection(token = "XXX");		///
				records <- exportRecords(rcon, factors=FALSE, labels=FALSE,	///
                           dates=FALSE, survey=FALSE, dag=TRUE,	///
                           batch.size=-1);	///
				max_length <- 30;	///
				records_shortened <- records;	///
				for (col in colnames(records)) {;	///
				  if (nchar(col) >= max_length) {;	///
					if (grepl("medication", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("medication", "med", col, ignore.case = TRUE);	///
					} else if (grepl("seizure", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("seizure", "seiz", col, ignore.case = TRUE);	///
					} else if (grepl("history", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("history", "hx", col, ignore.case = TRUE);	///
					} else if (grepl("abnormalities", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("abnormalities", "abnorm", col, ignore.case = TRUE);	///
					} else if (grepl("malforma", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("malforma", "mal", col, ignore.case = TRUE);	///
					} else if (grepl("etiology", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("etiology", "eti", col, ignore.case = TRUE);	///
					} else if (grepl("structural", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("structural", "str", col, ignore.case = TRUE);	///
					} else if (grepl("second", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("seoncd", "sec", col, ignore.case = TRUE);	///
					} else if (grepl("classifica", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("classifica", "class", col, ignore.case = TRUE);	///
					} else {;	///
					  new_col <- substring(col, 1, max_length);	///
					};	///
					colnames(records_shortened)[colnames(records_shortened) == col] <- new_col;	///
				  };	///
				};	///
				write_xlsx(records_shortened,"XXX/REDcap Test.xlsx");

spgarbet · 2023-03-02T16:23:54Z

What is the simplest reproducible example of this behavior?

mcfibb · 2023-03-02T19:12:32Z

Alrighty. Simplest approach:

Download these two files:
data_dictionary.csv
redcap_data.csv

Then, run the following R code:

source("[Directory Path Here]/missingSummary.R")
library(redcapAPI)
Miss <- missingSummary_offline("[Directory Path Here]/redcap_data.csv","[Directory Path Here]/data_dictionary.csv")
Miss

The output will be this:

record_id redcap_repeat_instrument n_missing      missing
1    123456                              0             
2    123456                      eeg         1  based_score
3    123456                      eeg         1  based_score
4    123456                  imaging         1 mri_uploaded

However, I know it's not catching everything because when spot-checking, I know that when "ser_amino_acids" equals 1, then there should be a response expected in "amino_acids_res". In the data, amino_acids_res=="" and therefore should show up in the Miss report, but it isn't there. The data dictionary line for "amino_acids_res" is:

amino_acids_res | genetics_metabolics | radio | Amino acids | 1, Normal \| 2,   Abnormal \| 3, Uncertain [ser_amino_acids]   = '1' |   |   | metab_results |

Anything else I can clarify?

nutterb · 2023-03-02T19:29:11Z

This is going to be long, but I'm not sure how to describe what's happening very succinctly. TLDR: the function has a pretty serious flaw. I made a data set with the following when I export it from REDCap.

> records
  record_id demographics_complete ser_amino_acids amino_acids_res genetics_metabolics_complete
1         1                     0               1              NA                            0

What we want to check is whether amino_acids is missing when ser_amino_acids = 1. When we parse the branching logic, we parse the branching logic, we come up with expression(ser_amino_acids == "1"). This is correct.

BUT (see all that formatting? this must be important), missingSummary iterates through all of the columns from left to right to look for missing values, and replaces the values in records with a logical flag indicating if the value is missing. Seems harmless enough, but look at what records becomes after we check ser_amino_acids for missing values.

  record_id demographics_complete ser_amino_acids amino_acids_res genetics_metabolics_complete
1         1                 FALSE           FALSE              NA                            0

Now when we compare ser_amino_acids == 1, we're going to get a FALSE, and we're not actually going to look for missingness in amino_acids. The terrifying thing about this is that no variable downstream of branching logic is going to be checked properly. This has never worked as intended (not with branching logic, anyway).

The offending part of the code is https://gist.github.com/nutterb/501c370418abb58bee78#file-missingsummary-r-L76-L99, where we are overwriting records. I believe it should be (for lines 87-97)...this will also require making a copy of records (records_orig <- records immediately following the exportRecords call)

 if (tmp_form == "_complete") records[[i]] <- FALSE
      if (!tmp_form %in% names(records_orig))
        records[[i]] <- is.na(records_orig[[i]])
      else if (!is.expression(l))
        records[[i]] <- ifelse(is.na(records_orig[[tmp_form]]), 
                               FALSE, is.na(records_orig[[i]]))
      else
        records[[i]] <- ifelse(is.na(records_orig[[tmp_form]]),
                               FALSE,
                               ifelse(with(records_orig, eval(l)), is.na(records_orig[[i]]),
                                      FALSE))

I'll post a correction in a few minutes.

nutterb · 2023-03-02T19:43:27Z

there's a correction up. @spgarbet, if you'd like, I can work up a cleaner version of this, since you've expressed interest in including it in the package.

mcfibb · 2023-03-02T20:04:39Z

Wow, you're awesome. And your explanation makes perfect sense. Thank you for all the help. I'll give this a shot and let you know what comes back.

mcfibb · 2023-03-02T20:07:15Z

Getting closer: Here's the output now:

 record_id redcap_repeat_instrument n_missing                                                                                                                                        missing
1    123456                             NA NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, outcome_2y_episurg_yn, NA, NA, amino_acids_res, acylcarnitines_res, NA, NA
2    123456                      eeg         1                                                                                                                                    based_score
3    123456                      eeg         1                                                                                                                                    based_score
4    123456                  imaging         1                                                                                                                                   mri_uploaded

Any idea why the NA is being spit out? Is that due to checkboxed values? (i.e. race = race____1, race____2, etc)?

nutterb · 2023-03-02T21:24:19Z

I don't think so. I just ran it against a database with the full data dictionary. I'm not getting that. I also just posted another update, so maybe I inadvertently fixed something, but I doubt it.

I'm wondering if the lack of a value in redcap_repeat_instrument is related. I don't have repeating instruments set up on my project right now. Give the new posting a try and let's see how it goes.

mcfibb · 2023-03-02T21:45:03Z

Thanks -- I'll give it a shot.

mcfibb · 2023-03-02T21:46:33Z

New error:

Error in .missingSummary_isMissingInField(records_orig, meta_data, logic) : 
  object 'records_orig' not found
In addition: Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on '/Users/xxxx/Downloads/redcap_data.csv'
Called from: .missingSummary_isMissingInField(records_orig, meta_data, logic)

Seems to stem from L150-152.

spgarbet · 2023-03-02T22:09:16Z

I'd love to include it in the package. We'd need a good test case for regression testing on updates.

nutterb · 2023-03-02T23:16:39Z

@mcfibb, Ah, missed declaring records_orig in the offline version. I'm guessing that's the one you used? I've properly tested that one now, too.

mcfibb · 2023-03-03T00:17:25Z

IT WORKS! @nutterb -- thank you so, so much.

nutterb · 2023-03-03T00:51:00Z

723rd time is a charm!

mcfibb · 2023-03-07T17:48:01Z

Looks like we'll need to make that 724 times. Something isn't right: in crosschecking missingness, now new non-missing are being counted as missing.

To replicate, download the following files:
data.csv
dictionary.csv

Then run missingSummary_offline. You'll see there's 121 missing items, including "amino_acids_res". While true, you'll see that the "gatekeeper" variable of "ser_amino_acirds" is NOT "1", and therefore, "amino_acids_res" should not be considered missing. (Sidenote: I know we keep using these variables, but it's more of a convenience thing than anything else. It's just emblematic of the larger issue.)

Ideas?

nutterb · 2023-03-07T21:50:08Z

I made the rather absurd assumption that the parsed logic list would have the same length as records. checkbox variables blow that assumption out of the water. I've updated the function to reference the parsed logic by field name. There are also some small improvements in counting the missing variables that avoid implicitly turning data frames into matrices.

Seems to be working well on your data example and the very small data example I'm building out to test this more robustly. Let me know what blows up when you generalize it over your larger data.

mcfibb · 2023-03-07T23:03:59Z

Thanks a ton! Will try out later today. :)

mcfibb · 2023-03-08T00:04:02Z

Seems to work so far. Will keep you posted!

mcfibb · 2023-03-08T23:46:34Z

Alright -- back again with another episode of "WTF: It's Not Just Me, Right?"

Before I get into detail, here's the bigger question: If a form is marked "Incomplete", will missingness show up in this report or is that form excluded from the summary code?

nutterb · 2023-03-09T00:55:10Z

It should still show up. Does your project have events or repeated forms. I discovered today that this can only handle records where all the branching logic is on the same row. So if there are prereqs in earlier events, we won't get the expected results.

…

On Wed, Mar 8, 2023, 6:46 PM mcfibb ***@***.***> wrote: Alright -- back again with another episode of "WTF: It's Not Just Me, Right?" Before I get into detail, here's the bigger question: If a form is marked "Incomplete", will missingness show up in this report or is that form excluded from the summary code? — Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAODI7OELRBNG55D3DU3QA3W3EK6LANCNFSM6AAAAAAUVXKT3M> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mcfibb · 2023-03-09T02:39:35Z

It does (repeated forms), but it turns out that isn't the issue, and neither is the "Incomplete" thing either.

Checkboxes. Why did it have to be checkboxes...

They are marked "0" at default, so of course they'd never show up as missing. So, I'm trying to figure out how to parse the file such that if all subvariables associated with that checkbox variable "=0", then include the checkbox primary variable as "missing".

I wrote some code on my own for that in Stata (which I use in tandem with R using Rcall), but I'm not quite sure how I'd combine it with the results of missingSummary. I'm guessing checkbox missingness isn't an included functionality at this point?

nutterb · 2023-03-09T12:56:35Z

I haven't ever considered checkbox missingness, mostly because whether all of them being unchecked is an indication of 'missing' is definitely situational. I'm inclined to think this would be easier to extract with a calculated field (see below for an example).

I wrote some code on my own for that in Stata (which I use in tandem with R using Rcall), but I'm not quite sure how I'd combine it with the results of missingSummary. I'm guessing checkbox missingness isn't an included functionality at this point?

I'm not familiar with Stata enough to talk about joins, but if you have your missingness value tied to the record id (and event/form IDs), a left join would be the approach I would take.

spgarbet · 2023-03-13T22:17:36Z

This code is now published on the main branch as part of the upcoming 2.4.2 release.

Remove stringr dependency

spgarbet · 2024-01-04T15:50:17Z

@mcfibb would you reach out to me at my email < shawn.garbett (at) vumc.org > ?

mcfibb · 2024-03-15T19:12:22Z

Hi everyone -- long time, no chat!

So, I'm back to my old REDcap data quality pulls, and I'm making good use of the missingSummary feature. I haven't updated the library because I don't want to break anything that I programmed a year ago, but I am running into some issues from where I left off vis a vis checkbox missingness.

I did indeed write my own "side-chained" code to merge missingSummary data with my own checkbox missingness code, but I made the (foolish) assumption in my own code that parsing logic wouldn't apply to checkboxes... which of course is not the case. (A visible checklist may trigger an invisible checklist). Turns out, this happens a lot in my data, so I've been producing quality reports that show missingness for checkbox data (i.e. all 0's) when in fact they've never been shown at all, so it would make sense they all have zeros. But my workaround hasn't considered all of the parsing logic for each checkbox.

So, with that long-winded context out of the way: has there been any progress on identifying checkbox missingness using missingSummary? (Based on the following logic: 1) the checkbox field passes the logic and was shown, and 2) all values of the checkbox field sum to 0 (e.g. nothing was checked))

Any help would be appreciated!

spgarbet · 2024-03-15T19:39:49Z

We've addressed a lot of issues recently around checkboxes but not that one. missingSummary has been revised a fair bit as well and there is a pull request awaiting.

One issue with the design of a checkbox is that by definition they always have a value. I.e. by their nature and definition they are never missing (this causes weird issues with detecting emptyRows). Your proposal is for a group of checkboxes that they are considered missing or empty when the entire group is unchecked. I think that could be accomplished but it would be a choice on the part of the user, i.e. it's an external definition of missing that is not natural to the domain. That said, I can see the usefulness of such a proposal. I am reaching trying to think how it could be defined via branch logic or smart variables. I'll open a new ticket for that discussion.

mcfibb · 2024-03-15T20:06:31Z

Thanks Shawn — and totally agree. In the clinical research I touch leveraging REDcap, this is definitely a common enough issue where I’d be more than happy to work with you closely to find a fix. :)

…

On Fri, Mar 15, 2024 at 12:40 PM Shawn Garbett ***@***.***> wrote: We've addressed a lot of issues recently around checkboxes but not that one. missingSummary has been revised a fair bit as well and there is a pull request awaiting. One issue with the design of a checkbox is that by definition they *always* have a value. I.e. by their nature and definition they are never missing (this causes weird issues with detecting emptyRows). Your proposal is for a group of checkboxes that they are considered missing or empty when the entire group is unchecked. I think that could be accomplished but it would be a choice on the part of the user, i.e. it's an external definition of missing that is not natural to the domain. That said, I can see the usefulness of such a proposal. I am reaching trying to think how it could be defined via branch logic or smart variables. I'll open a new ticket for that discussion. — Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALIA5CHSC4EXISIZMVHFBKDYYNFBXAVCNFSM6AAAAAAUVXKT3OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBQGMZTMOBZGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

spgarbet added the bug label Feb 23, 2023

spgarbet added enhancement question and removed bug enhancement labels Mar 15, 2023

spgarbet closed this as completed Mar 16, 2023

spgarbet pushed a commit that referenced this issue Mar 21, 2023

Merge pull request #5 from nutterb/Issue29-stringr

b02bb2e

Remove stringr dependency

spgarbet mentioned this issue Mar 21, 2023

missingSummary Function (Potential Issue with exportRecords?) nutterb/redcapAPI#9

Closed

spgarbet mentioned this issue Mar 15, 2024

Checkbox Missing Definition #349

Open

spgarbet mentioned this issue Jul 17, 2024

exportBulkRecords() Pulling forms that do not exist #404

Open

missingSummary Function (Potential Problem with exportRecords?) #5

missingSummary Function (Potential Problem with exportRecords?) #5

Comments

mcfibb commented Feb 8, 2023

mcfibb commented Feb 8, 2023

spgarbet commented Feb 23, 2023

mcfibb commented Feb 23, 2023

nutterb commented Feb 23, 2023

mcfibb commented Feb 23, 2023 • edited Loading

nutterb commented Feb 23, 2023

mcfibb commented Feb 23, 2023 • edited Loading

mcfibb commented Feb 27, 2023

spgarbet commented Feb 27, 2023

nutterb commented Feb 27, 2023

mcfibb commented Feb 27, 2023 • edited Loading

spgarbet commented Mar 2, 2023

mcfibb commented Mar 2, 2023

nutterb commented Mar 2, 2023

nutterb commented Mar 2, 2023

mcfibb commented Mar 2, 2023

mcfibb commented Mar 2, 2023

nutterb commented Mar 2, 2023

mcfibb commented Mar 2, 2023

mcfibb commented Mar 2, 2023 • edited Loading

spgarbet commented Mar 2, 2023

nutterb commented Mar 2, 2023

mcfibb commented Mar 3, 2023

nutterb commented Mar 3, 2023

mcfibb commented Mar 7, 2023

nutterb commented Mar 7, 2023

mcfibb commented Mar 7, 2023

mcfibb commented Mar 8, 2023

mcfibb commented Mar 8, 2023

nutterb commented Mar 9, 2023 via email

mcfibb commented Mar 9, 2023

nutterb commented Mar 9, 2023

spgarbet commented Mar 13, 2023

spgarbet commented Jan 4, 2024

mcfibb commented Mar 15, 2024

spgarbet commented Mar 15, 2024

mcfibb commented Mar 15, 2024 via email

mcfibb commented Feb 23, 2023 •

edited

Loading

mcfibb commented Feb 23, 2023 •

edited

Loading

mcfibb commented Feb 27, 2023 •

edited

Loading

mcfibb commented Mar 2, 2023 •

edited

Loading