Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missingSummary Function (Potential Problem with exportRecords?) #5

Closed
mcfibb opened this issue Feb 8, 2023 · 37 comments
Closed

missingSummary Function (Potential Problem with exportRecords?) #5

mcfibb opened this issue Feb 8, 2023 · 37 comments
Labels

Comments

@mcfibb
Copy link

mcfibb commented Feb 8, 2023

Hi there -- so appreciate this library! Hoping this is the right place to ask this:

On first glance, the missingSummary ancillary function for redcapAPI works perfectly. However, when cross-checking true missingness with REDcap open, there are a number of branching logic values where conditions are met, missingness still occurs, but the code doesn't report the missing value. Is it possible this is related to one of the options in the exportRecords function being the wrong value or perhaps the function isn't pulling in all available data?

Here's a link to the missingSummary gist: https://gist.github.com/nutterb/501c370418abb58bee78

@mcfibb
Copy link
Author

mcfibb commented Feb 8, 2023

EDIT: Same result with the missingSummary_offline function.

@spgarbet spgarbet added the bug label Feb 23, 2023
@spgarbet
Copy link
Member

This is an interesting case. Essentially what this says is that the branching logic parsing in the library does not match what's going on in REDCap. If you have the time, please outline the branching logic in REDCap and I'll see if I can reproduce.

@mcfibb
Copy link
Author

mcfibb commented Feb 23, 2023

Happy to send attach some example data + the data dictionary. It's strange because I'm using the API pull for both the data and the dictionary in the initial call.

I'll update this in an hour or so with the the example data.

@nutterb
Copy link
Collaborator

nutterb commented Feb 23, 2023

Has anything changed in how the branching logic is written to the metadata in the past...I don't know how long it's been since this was written. I would think there's a nontrivial probability that parseBranchingLogic is out of date.

@mcfibb
Copy link
Author

mcfibb commented Feb 23, 2023

Oh, interesting. I have no clue. Here's the data and dictionary.

For the data, I've given just one case, but across a few different forms. I caught the missingness error on the branching logic for variables "ser_amino" and "amino_acids_res", the latter of which has a dependency on the former being equal to "1". In the example data, "ser_amino==1", so "amino_acids_res" was shown, but there was a blank response. In theory, missingSummary should catch this, but it doesn't. It catches other missingness, but not this one.

meta.csv
REDcap API Export.csv

@nutterb
Copy link
Collaborator

nutterb commented Feb 23, 2023

Have you done anything to this metadata file?

I ask because the column order in the file you attached is nothing like the column order the function is assuming.

@mcfibb
Copy link
Author

mcfibb commented Feb 23, 2023

Ah, the metafile a gave you was from a different library (me trying to circumvent the issues here while waiting to hear back). Here's the actual data dictionary, which I pulled using the following code in R:

meta_data <- exportMetaData(rcon)
write_excel_csv(meta_data,"~Downloads/meta.csv")

meta.csv

@mcfibb
Copy link
Author

mcfibb commented Feb 27, 2023

Hi all -- just following up on this. Any ideas?

@spgarbet
Copy link
Member

I'm hoping @nutterb has an idea. I have a test framework setup for issues like this (working Issue #6). If I can reproduce this and understand the root cause then we can turn it around.

@nutterb
Copy link
Collaborator

nutterb commented Feb 27, 2023

It doesn't look like processing the branching logic has changed; I think I can rule that out.

I'm afraid I haven't made much progress on this otherwise. I can't get missingSummary_offline to run without encountering an error. I'm running into problems where the code can't find variables it seems to want to find. As far as I can tell, any variable name that would exceed 32 characters is getting truncated. So "first_tx_other_medications_tx___99" is showing up at "first_tx_other_medications_tx___" in the data set. Which is a problem because it's the same variable name for every one of those checkbox options.

The form complete variables are also kicking up trouble. For instance ""seizuresepilepsy_history_complete" is appearing in the data set at ""seizuresepilepsy_history_complet". Again, exactly 32 characters (which I think is a SAS or SPSS limit). I can't say if that has anything to do with the failure to catch certain missingness (I doubt it), but it is making it difficult to replicate the finding.

@mcfibb
Copy link
Author

mcfibb commented Feb 27, 2023

Yeah -- I feel you there! The 32 character limit is true too for Stata. I ended up needing to recode variables on export via Stata & R to remove characters from the middle, preserving the beginning and ending characters. I inherited the database, so I'm in the trenches banging my head against a wall right along with you. :)

Here's that "import" code in R (ignore the "///". I ran the code using RCALL in Stata):

                                library(devtools);			///
				source("$Code/501c370418abb58bee78-4a0546caee0928f9ad91ad850cc751eb2d5d3675/missingSummary.R");	///
				library(redcapAPI);			///
				library(stringr);			///
				library(tidyr);				///
				library(writexl);	///
				options(redcap_api_url = "https://XXX);	///
				rcon <- redcapConnection(token = "XXX");		///
				records <- exportRecords(rcon, factors=FALSE, labels=FALSE,	///
                           dates=FALSE, survey=FALSE, dag=TRUE,	///
                           batch.size=-1);	///
				max_length <- 30;	///
				records_shortened <- records;	///
				for (col in colnames(records)) {;	///
				  if (nchar(col) >= max_length) {;	///
					if (grepl("medication", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("medication", "med", col, ignore.case = TRUE);	///
					} else if (grepl("seizure", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("seizure", "seiz", col, ignore.case = TRUE);	///
					} else if (grepl("history", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("history", "hx", col, ignore.case = TRUE);	///
					} else if (grepl("abnormalities", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("abnormalities", "abnorm", col, ignore.case = TRUE);	///
					} else if (grepl("malforma", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("malforma", "mal", col, ignore.case = TRUE);	///
					} else if (grepl("etiology", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("etiology", "eti", col, ignore.case = TRUE);	///
					} else if (grepl("structural", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("structural", "str", col, ignore.case = TRUE);	///
					} else if (grepl("second", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("seoncd", "sec", col, ignore.case = TRUE);	///
					} else if (grepl("classifica", col, ignore.case = TRUE)) {;	///
					  new_col <- gsub("classifica", "class", col, ignore.case = TRUE);	///
					} else {;	///
					  new_col <- substring(col, 1, max_length);	///
					};	///
					colnames(records_shortened)[colnames(records_shortened) == col] <- new_col;	///
				  };	///
				};	///
				write_xlsx(records_shortened,"XXX/REDcap Test.xlsx");

@spgarbet
Copy link
Member

spgarbet commented Mar 2, 2023

What is the simplest reproducible example of this behavior?

@mcfibb
Copy link
Author

mcfibb commented Mar 2, 2023

Alrighty. Simplest approach:

Download these two files:
data_dictionary.csv
redcap_data.csv

Then, run the following R code:

source("[Directory Path Here]/missingSummary.R")
library(redcapAPI)
Miss <- missingSummary_offline("[Directory Path Here]/redcap_data.csv","[Directory Path Here]/data_dictionary.csv")
Miss

The output will be this:

record_id redcap_repeat_instrument n_missing      missing
1    123456                              0             
2    123456                      eeg         1  based_score
3    123456                      eeg         1  based_score
4    123456                  imaging         1 mri_uploaded

However, I know it's not catching everything because when spot-checking, I know that when "ser_amino_acids" equals 1, then there should be a response expected in "amino_acids_res". In the data, amino_acids_res=="" and therefore should show up in the Miss report, but it isn't there. The data dictionary line for "amino_acids_res" is:

amino_acids_res | genetics_metabolics | radio | Amino acids | 1, Normal \| 2,   Abnormal \| 3, Uncertain [ser_amino_acids]   = '1' |   |   | metab_results |

Anything else I can clarify?

@nutterb
Copy link
Collaborator

nutterb commented Mar 2, 2023

This is going to be long, but I'm not sure how to describe what's happening very succinctly. TLDR: the function has a pretty serious flaw. I made a data set with the following when I export it from REDCap.

> records
  record_id demographics_complete ser_amino_acids amino_acids_res genetics_metabolics_complete
1         1                     0               1              NA                            0

What we want to check is whether amino_acids is missing when ser_amino_acids = 1. When we parse the branching logic, we parse the branching logic, we come up with expression(ser_amino_acids == "1"). This is correct.

BUT (see all that formatting? this must be important), missingSummary iterates through all of the columns from left to right to look for missing values, and replaces the values in records with a logical flag indicating if the value is missing. Seems harmless enough, but look at what records becomes after we check ser_amino_acids for missing values.

  record_id demographics_complete ser_amino_acids amino_acids_res genetics_metabolics_complete
1         1                 FALSE           FALSE              NA                            0

Now when we compare ser_amino_acids == 1, we're going to get a FALSE, and we're not actually going to look for missingness in amino_acids. The terrifying thing about this is that no variable downstream of branching logic is going to be checked properly. This has never worked as intended (not with branching logic, anyway).

The offending part of the code is https://gist.github.com/nutterb/501c370418abb58bee78#file-missingsummary-r-L76-L99, where we are overwriting records. I believe it should be (for lines 87-97)...this will also require making a copy of records (records_orig <- records immediately following the exportRecords call)

 if (tmp_form == "_complete") records[[i]] <- FALSE
      if (!tmp_form %in% names(records_orig))
        records[[i]] <- is.na(records_orig[[i]])
      else if (!is.expression(l))
        records[[i]] <- ifelse(is.na(records_orig[[tmp_form]]), 
                               FALSE, is.na(records_orig[[i]]))
      else
        records[[i]] <- ifelse(is.na(records_orig[[tmp_form]]),
                               FALSE,
                               ifelse(with(records_orig, eval(l)), is.na(records_orig[[i]]),
                                      FALSE))  

I'll post a correction in a few minutes.

@nutterb
Copy link
Collaborator

nutterb commented Mar 2, 2023

there's a correction up. @spgarbet, if you'd like, I can work up a cleaner version of this, since you've expressed interest in including it in the package.

@mcfibb
Copy link
Author

mcfibb commented Mar 2, 2023

Wow, you're awesome. And your explanation makes perfect sense. Thank you for all the help. I'll give this a shot and let you know what comes back.

@mcfibb
Copy link
Author

mcfibb commented Mar 2, 2023

Getting closer: Here's the output now:

 record_id redcap_repeat_instrument n_missing                                                                                                                                        missing
1    123456                             NA NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, outcome_2y_episurg_yn, NA, NA, amino_acids_res, acylcarnitines_res, NA, NA
2    123456                      eeg         1                                                                                                                                    based_score
3    123456                      eeg         1                                                                                                                                    based_score
4    123456                  imaging         1                                                                                                                                   mri_uploaded

Any idea why the NA is being spit out? Is that due to checkboxed values? (i.e. race = race____1, race____2, etc)?

@nutterb
Copy link
Collaborator

nutterb commented Mar 2, 2023

I don't think so. I just ran it against a database with the full data dictionary. I'm not getting that. I also just posted another update, so maybe I inadvertently fixed something, but I doubt it.

I'm wondering if the lack of a value in redcap_repeat_instrument is related. I don't have repeating instruments set up on my project right now. Give the new posting a try and let's see how it goes.

@mcfibb
Copy link
Author

mcfibb commented Mar 2, 2023

Thanks -- I'll give it a shot.

@mcfibb
Copy link
Author

mcfibb commented Mar 2, 2023

New error:

Error in .missingSummary_isMissingInField(records_orig, meta_data, logic) : 
  object 'records_orig' not found
In addition: Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on '/Users/xxxx/Downloads/redcap_data.csv'
Called from: .missingSummary_isMissingInField(records_orig, meta_data, logic)

Seems to stem from L150-152.

@spgarbet
Copy link
Member

spgarbet commented Mar 2, 2023

I'd love to include it in the package. We'd need a good test case for regression testing on updates.

@nutterb
Copy link
Collaborator

nutterb commented Mar 2, 2023

@mcfibb, Ah, missed declaring records_orig in the offline version. I'm guessing that's the one you used? I've properly tested that one now, too.

@mcfibb
Copy link
Author

mcfibb commented Mar 3, 2023

IT WORKS! @nutterb -- thank you so, so much.

@nutterb
Copy link
Collaborator

nutterb commented Mar 3, 2023

723rd time is a charm!

@mcfibb
Copy link
Author

mcfibb commented Mar 7, 2023

Looks like we'll need to make that 724 times. Something isn't right: in crosschecking missingness, now new non-missing are being counted as missing.

To replicate, download the following files:
data.csv
dictionary.csv

Then run missingSummary_offline. You'll see there's 121 missing items, including "amino_acids_res". While true, you'll see that the "gatekeeper" variable of "ser_amino_acirds" is NOT "1", and therefore, "amino_acids_res" should not be considered missing. (Sidenote: I know we keep using these variables, but it's more of a convenience thing than anything else. It's just emblematic of the larger issue.)

Ideas?

@nutterb
Copy link
Collaborator

nutterb commented Mar 7, 2023

I made the rather absurd assumption that the parsed logic list would have the same length as records. checkbox variables blow that assumption out of the water. I've updated the function to reference the parsed logic by field name. There are also some small improvements in counting the missing variables that avoid implicitly turning data frames into matrices.

Seems to be working well on your data example and the very small data example I'm building out to test this more robustly. Let me know what blows up when you generalize it over your larger data.

@mcfibb
Copy link
Author

mcfibb commented Mar 7, 2023

Thanks a ton! Will try out later today. :)

@mcfibb
Copy link
Author

mcfibb commented Mar 8, 2023

Seems to work so far. Will keep you posted!

@mcfibb
Copy link
Author

mcfibb commented Mar 8, 2023

Alright -- back again with another episode of "WTF: It's Not Just Me, Right?"

Before I get into detail, here's the bigger question: If a form is marked "Incomplete", will missingness show up in this report or is that form excluded from the summary code?

@nutterb
Copy link
Collaborator

nutterb commented Mar 9, 2023 via email

@mcfibb
Copy link
Author

mcfibb commented Mar 9, 2023

It does (repeated forms), but it turns out that isn't the issue, and neither is the "Incomplete" thing either.

Checkboxes. Why did it have to be checkboxes...

They are marked "0" at default, so of course they'd never show up as missing. So, I'm trying to figure out how to parse the file such that if all subvariables associated with that checkbox variable "=0", then include the checkbox primary variable as "missing".

I wrote some code on my own for that in Stata (which I use in tandem with R using Rcall), but I'm not quite sure how I'd combine it with the results of missingSummary. I'm guessing checkbox missingness isn't an included functionality at this point?

@nutterb
Copy link
Collaborator

nutterb commented Mar 9, 2023

I haven't ever considered checkbox missingness, mostly because whether all of them being unchecked is an indication of 'missing' is definitely situational. I'm inclined to think this would be easier to extract with a calculated field (see below for an example).

image

I wrote some code on my own for that in Stata (which I use in tandem with R using Rcall), but I'm not quite sure how I'd combine it with the results of missingSummary. I'm guessing checkbox missingness isn't an included functionality at this point?

I'm not familiar with Stata enough to talk about joins, but if you have your missingness value tied to the record id (and event/form IDs), a left join would be the approach I would take.

@spgarbet
Copy link
Member

This code is now published on the main branch as part of the upcoming 2.4.2 release.

@spgarbet
Copy link
Member

spgarbet commented Jan 4, 2024

@mcfibb would you reach out to me at my email < shawn.garbett (at) vumc.org > ?

@mcfibb
Copy link
Author

mcfibb commented Mar 15, 2024

Hi everyone -- long time, no chat!

So, I'm back to my old REDcap data quality pulls, and I'm making good use of the missingSummary feature. I haven't updated the library because I don't want to break anything that I programmed a year ago, but I am running into some issues from where I left off vis a vis checkbox missingness.

I did indeed write my own "side-chained" code to merge missingSummary data with my own checkbox missingness code, but I made the (foolish) assumption in my own code that parsing logic wouldn't apply to checkboxes... which of course is not the case. (A visible checklist may trigger an invisible checklist). Turns out, this happens a lot in my data, so I've been producing quality reports that show missingness for checkbox data (i.e. all 0's) when in fact they've never been shown at all, so it would make sense they all have zeros. But my workaround hasn't considered all of the parsing logic for each checkbox.

So, with that long-winded context out of the way: has there been any progress on identifying checkbox missingness using missingSummary? (Based on the following logic: 1) the checkbox field passes the logic and was shown, and 2) all values of the checkbox field sum to 0 (e.g. nothing was checked))

Any help would be appreciated!

@spgarbet
Copy link
Member

We've addressed a lot of issues recently around checkboxes but not that one. missingSummary has been revised a fair bit as well and there is a pull request awaiting.

One issue with the design of a checkbox is that by definition they always have a value. I.e. by their nature and definition they are never missing (this causes weird issues with detecting emptyRows). Your proposal is for a group of checkboxes that they are considered missing or empty when the entire group is unchecked. I think that could be accomplished but it would be a choice on the part of the user, i.e. it's an external definition of missing that is not natural to the domain. That said, I can see the usefulness of such a proposal. I am reaching trying to think how it could be defined via branch logic or smart variables. I'll open a new ticket for that discussion.

@mcfibb
Copy link
Author

mcfibb commented Mar 15, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants