Skip to content

Suggested a fix in defining study code in the createStudyTable function#25

Merged
cmirzayi merged 3 commits intomainfrom
devel_ga
Mar 19, 2025
Merged

Suggested a fix in defining study code in the createStudyTable function#25
cmirzayi merged 3 commits intomainfrom
devel_ga

Conversation

@g-antonello
Copy link
Copy Markdown
Contributor

In the previous implementation the study code in createStudyTable did not account for the following edge cases:

  • Same first author name AND same year, but different publication
  • Special characters that the regex could not capture (case of PMID 34884399)

This resulted in a different number of lines in the study tables (run code example attached to verify).

Additionally, I have added a parameter to the function to allow the user to arbitrarily add other columns as they think best, with the includeAlso parameter.

bsdb.df <- bugsigdbr::importBugSigDB(cache = FALSE)

# Some studies have NAs on PMID, URL and DOI, but not in all 3
table(rowSums(is.na(select(bsdb, PMID, DOI, URL))), useNA = "always")

# unique studies based on their various links  
unique_links <- unique(paste(bsdb$PMID, bsdb$DOI, bsdb$URL))
length(unique_links)


# old function as in v 0.99.5

createStudyTable_old <-function(dat){
  studies <- data.frame(Study=paste0(str_extract(dat$Authors, "[A-Za-z]+[:space:]"), dat$Year),
                        Condition=dat$Condition,
                        Cases=dat$`Group 1 sample size`,
                        Controls=dat$`Group 0 sample size`,
                        `Study Design`=dat$`Study design`)
  studies %>% group_by(Study) %>% summarize(Condition=first(Condition), 
                                            Cases=max(Cases),
                                            Controls=max(Controls), 
                                            `Study Design`=first(`Study.Design`))
  
}

# newly proposed function

createStudyTable_new <- function(bsdb.df, includeAlso = NULL) {
  # input check
  if (!is_null(includeAlso)) {
    if (!all(includeAlso %in% colnames(bsdb.df))) {
      stop(paste(
        "The following columns are not found in the input data frame:",
        paste(includeAlso[!(includeAlso %in% colnames(bsdb.df))], collapse = ", ")
      ))
    }
  }
  # Core of the change is in how study IDs are generated, see function in 
  # simple.R. NB: the function also fixes DOI links as side effect, now. 
  bsdb_with_StudyCodes.df <- .make_unique_study_ID(bsdb.df)
  
  # some dplyr-fu to summarize tables, with more recent syntax
  study_table_fixed <- bsdb_with_StudyCodes.df %>%
    group_by(`Study code`) %>%
    reframe(
      MaxCases = max(`Group 1 sample size`),
      MaxControls = max(`Group 0 sample size`),
      across(
        all_of(
          c("Study design", "Condition", "PMID", "DOI", "URL", includeAlso)
        ),
        .fns = function(x)
          paste(unique(x), collapse = "; ")
      ),
      N_signatures = n()
    ) %>%
    relocate(N_signatures, .after = Condition)
  
  return(study_table_fixed)
}

study_table_old <- createStudyTable_old(bsdb.df)

study_table_new <- createStudyTable_new(bsdb.df)

dim(study_table_old)

dim(study_table_new)

@cmirzayi
Copy link
Copy Markdown
Collaborator

I think this is a really cool improvement. Thank you. I figure we can merge it in.

@cmirzayi cmirzayi merged commit af4a6bb into main Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants