repDiagnosis

Repository for R package repDiagnosis and R shiny app for live replication diagnosis tools. This package accompanies the paper "Diagnosing the role of observable distribution shift in scientific replications" by Ying Jin, Kevin Guo and Dominik Rothenhäusler. [Reference]

Rshiny app. To quickly get started with our method, we recommend you visit our interactive Rshiny app and play with our preloaded clean datasets. Our Rshiny app is useful if you want to quickly diagnose your replication study pairs with easy-to-use features, such as live displaying and visualizing data, specifying your study, automatically checking regression in two studies, etc., or if you want to probe the generalizability of your single study (this function is not in our package and paper). [Link to Rshiny app]

Dataset collection. In a sibling Github repository awesome-replicability-data, we collect individual-level data for 11 original-replication study pairs from publicly available sources, and 5 one-sided replication studies. Visit our data collection to explore these datasets and our pre-processed clean versions.

Install R Package `repDiagnosis`

Install the devtools package using install.packages("devtools").
Install the latest development version using devtools::install_github("ying531/repDiagnosis").

Example

We walk through the replication study on EMDR and misinformation to show the use of our method, which decomposes the discrepancy between original and replication study estimates into several interpretable pieces (for details and cleaned datasets, see Data pair 3 in our dataset collection).

Load datasets. First, the paired datasets can be directly loaded from the package. Use data() to load them as org.dat for original study and rep.dat for replication study. You can also use your own datasets, but make sure their column names are consistent.

data("emdr_misinfo")

Specify regression. In this study, the regression of interest is totalcorrect on the treatment indicator named condition. Specify the regression formula as you do for linear models, and specify the treatment variable name.

analysis_formula = "totalcorrect ~ condition"
treatment_variable = "condition"

Specify covariates and mediators. Suppose you are interested in the heterogeneity between two studies captured by covariates age and gender. Also, you wonder whether the mediators postvividness and postemotionality capture any more heterogeneity. Specify these quantities as vectors.

covariates = c("age", "gender")
mediators = c("postvividness", "postemotionality")

Specify group IDs for clustered experiments. This is not needed for this dataset. If you like, you can also specify group IDs as X, which is a redundant variable that indicates the participant ID.

Run the decomposition. Finally, call our analysis function to run the diagnosis. It returns a summary table and a decomposition plot similar to our paper. By default, it also computes post-selective CIs that are valid at 1-alpha level after adjusting for the publication bias (i.e., only publishing p-values smaller than pub_pvalue_threshold whose default value is 0.05).

set.seed(1)
results = run_diagnosis(
  org_df = org.dat, # data frame for original study
  rep_df = rep.dat, # data frame for replication study 
  analysis_formula = analysis_formula, # regression formula 
  treatment_name = treatment_name, # column name of treatment indicator
  focal_variable = NULL, # focal variable, set as treatment_name if NULL
  covariates = covariates, # covariates to balance
  mediators = mediators, # mediators to balance 
  cluster_id = "X", # can also be set to NULL in this case
  selection_variable = NULL, # variable on which the selection happens, set  treatment_name if NULL
  alpha = 0.1, # confidence level for all CIs
  verbose = TRUE, # printing the tables and plots
  if_selective = TRUE, # conduct the selective inference CIs
  pub_pvalue_threshold = 0.05 # p-value threshold for publication bias
)

Display information. If verbose=TRUE, it will display the progress of jackknife for computing SEs, and print a message for how the selection adjustment is conducted. In this example, it displays:

Jackknifing [============================================================>] 100%

Message for selective inferece:
NOTE: Running selective inference conditional on p <= 0.05!

Table and plot. The object results contain a summary table of decomposition and a decomposition plot as in our paper, and their counterparts after adjusting for publication bias. It also returns the message for how the selection adjustment is conducted.

> results$table
                     Estimate Std. Error  t-stat Pr(>|z|)
Observed discrepancy  -1.1321     0.3361 -3.3681   0.0008
Covariates             0.1409     0.2315  0.6085   0.5428
Mediators              0.0150     0.0998  0.1500   0.8808
Residual              -1.2879     0.3984 -3.2328   0.0012

and the following decomposition plot:

Selection-adjusted table and plot. If if_selective=TRUE, it runs selective inference procedures. It will additionally display the following summary table:

> results$sel.table
                      CI.Low CI.High Estimate
Observed discrepancy -1.6850 -0.5427  -1.1321
Covariate shift      -0.2399  0.5216   0.1409
Mediation shift      -0.1492  0.1792   0.0150
Residual             -1.9433 -0.6188  -1.2831

and the following decomposition plot that adjusts for publication bias:

This plot does not differ much from the first because the original p-value is tiny, hence adjusting for publication bias does not change things too much.

Documentation

run_diagnosis <- function(
    org_df,
    rep_df,
    analysis_formula,
    treatment_variable, 
    focal_variable = NULL,  
    covariates = NULL,
    covariate_formula = NULL,
    mediators = NULL,
    mediation_formula = NULL,
    cluster_id = NULL,
    selection_variable = NULL,
    alpha = 0.1,
    verbose = TRUE,
    if_selective = TRUE,
    pub_pvalue_threshold = 0.05
)

This function takes two data.frame's for the original study and replication study as input, and conducts the replicability decomposition on the fitted coefficient for focal_variable in a linear regression with analysis_formula. If if_selective == True, then it conducts post-selection inference for the decomposition given that the p-value for selection_variable is smaller than pub_pvalue_threshold.

Below we describe in details how the arguments should be specified.

Arguments	Description
`org_df`	Dataframe for individual-level data in the original study
`rep_df`	Dataframe for individual-level data in the replication study; must match the format of `org_df` for features involved in the analysis
`analysis_formula`	The analysis formula for the regression, such as "outcome ~ treatment"
`treatment_variable`	The column name for the treatment indicator (1=treatment)
`focal_variable`	Optional, Optional, string for the column name of the variable on which the decomposition is conducted; if not specified, it will be set as `treatment_variable`
`covariates`	Optional, vector for column names or column number of covariates to balance
`covariate_formula`	Optional, formula for covariates to balance. - For instance, to balance `age` and `gender`, you can take it as `"~ age + gender"`, or `"~(age+gender) * treatment"` meaning that they are balanced in each arm; the two choices should not differ significantly. - If both `covariates` and `covariate_formula` are specified, covariate_formula will be taken. - In our implementation, taking `covariates = c("age", "gender")` is equivalent to `covariate_formula = "~ (age+gender) * treatment"`.
`mediators`	Optional, vector for column names or column number of mediators to balance
`mediation_formula`	Optional, formula for mediators to balance, which should also include covariates. - For instance, if you want to balance `my_mediator` on top of `age` and `gender`, the mediation formula should be `"~ (age + gender + mymediator) * treatment"`, which means "mediator" is viewed as a mediator to balance. - If both `mediators` and `mediation_formula` are specified, mediation_formula will be taken.
`cluster_id`	Optional, string for the column name of the cluster id in the studies (specify if the study is clustered)
`selection_variable`	Optional, string for the column name on which the publication bias happens; if not specified, it will be set the same as `focal_variable`
`alpha`	Optional, the confidence level for the confidence intervals, `= 0.1` by default
`verbse`	Optional, `= TRUE` by default; print the Jackknife progress and selective inference message if `verbose == TRUE`
`if_selective`	Optional, `= TRUE` by default; conduct post-selective inference for publication bias if `if_selecitve == True
`pub_pvalue_threshold`	Optional, `= 0.05` by default; the p-value threshold for publication bias adjustment. Selective inference will stop and return a message if the p-value in the original study for `focal_variable` is smaller than this value.

Below are the outputs in the returned list by the function.

Output	Description
`plot`	Decomposition plot without publication bias adjustment
`table`	Summary table without publication bias adjustment
`sel.plot`	Decomposition plot with publication bias adjustment; will be `NULL` if selective inference is not run or post-selection intervals are too wide (outside of estimate += 256 * SE)
`sel.table`	Summary table with publication bias adjustment; shows `-Inf` and `Inf` for confidence intervals if no meaningful CIs are found within estimate += 256 * SE
`sel.message`	Messages for selective inference, including how it is conducted and whether any CI is too wide; will be displayed if `verbose==TRUE`

Reference

Please use the following citation if you use this collection in your study, or you use our softwares for analyzing replication studies.

@article{jin2023diagnosing,
  title={Diagnosing the role of observable distribution shift in scientific replications},
  author={Jin, Ying and Guo, Kevin and Rothenh{\"a}usler, Dominik},
  journal={arXiv preprint arXiv:2309.01056},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
R		R
data		data
docs		docs
man		man
pkgdown		pkgdown
vignettes		vignettes
.DS_Store		.DS_Store
.Rbuildignore		.Rbuildignore
.Rhistory		.Rhistory
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
Rpackage.Rproj		Rpackage.Rproj
_pkgdown.yml		_pkgdown.yml

License

ying531/repDiagnosis

Folders and files

Latest commit

History

Repository files navigation

repDiagnosis

Install R Package repDiagnosis

Example

Documentation

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Install R Package `repDiagnosis`