A Public Framework for Testing Different Imputation Methods for Clinical Datasets

Project aim

This repository provides a framework to evaluate imputation methods for different percentages of missing values. The analysis includes two aspects: 1) error analysis in terms of root mean squared errror (RMSE) and absolute bias and 2) predictive model analysis in terms of area under the curve (AUC) of a generalized linear model (GLM). Missing values are removed in two different fashions: 1) missing-at-random (MAR) and 2) missing-completely-at-random (MCAR). The following imputation methods are compared:

mean imputation
median imputation
multiple imputation by chained equations (MICE)
hot-deck ("sampling")
expectation maximization (EM)
listwise deletion (for predictive model analysis only)

Folder structure

.
+-- code
+-- data
+-- imputations
|   +-- different-samples
|   +-- same-samples
+-- results
|   +-- error-analysis
|   |   +-- different-samples
|   |   +-- same-samples
|   +-- predictive-model
|   |   +-- across-parameters
|   |   +-- per-parameter
+-- figures
|   +-- error-analysis
|   |   +-- per-parameter
|   +-- predictive-model
|   |   +-- with-listwise
|   |   +-- without-listwise

All executable code can found in the code folder and is further described in the code files section. The data folder contains the data that should be analyzed.

In the imputations folder all calculated imputations for the different imputation methods are stored. The imputations are computed in two different fashions: 1) different samples and 2) same samples. Using different samples, different missing values are imputed in each iteration for every imputation method. It is used to calculate the root mean squared error (RMSE). When applying same samples, the same missing values are imputed in each iteration for every imputation method. This is crucial in order to determine the absolute bias of the imputation methods.

The results folder contains both the results of the error analysis (RMSE and absolute bias) and the results of the predictive model analysis (area under the curve, AUC). The predictive model folder is split into across parameters and per parameter. Across parameters means that values are removed across parameters for each percentage. In the per parameter fashion values are removed for one parameter at a time for each percentage.

In the figures folders all plots from the different analyses are saved.

Code files

General code files:
- config.R: contains all configurations of both analyses.
- helper_functions.R: contains all helper functions for both analyses.
Error analysis:
- run_imputation_per_parameter.R: calculates imputations for every method and parameter and stores it.
- evaluate_error.R: computes the RMSE and absolute bias and stores it.
- plot_error_results.R: plots the error analysis averaged over numerical and categorical data and stores it.
- plot_error_results_per_param.R: plots the error analysis per parameter and stores it.
- run_error_statistical_analysis.R: calculates the p values comparing the RMSE of the different imputation methods.
Predictive model analysis:
- run_pred_model_across_parameters.R: computes the AUCs for a GLM model using imputed data. Missing values were introduced across parameters. Resulting AUCs are then stored.
- run_pred_model_per_parameter.R: calculates the AUCs for a GLM model using imputed data. Missing values were introduced for one parameter at a time. Resulting AUCs are then stored.
- plot_perf_across_parameters.R: plots the AUCs and corresponding p values and saves the plot.
- plot_perf_per_parameter.R: plots the AUCs resulted from run_pred_model_per_parameter.R.
- run_pred_model_statistical_analysis.R: runs a more detailed statistical analysis comparing the different imputation methods between each other and to the full model (model using full dataset).

How to run the analysis

General:

Put your data into the data folder and change the file path in config.R.
Change parameters (if necessary) in the config.R file.
Adapt the preprocess and definedatatypes functions in the helper_functions.R file according to your dataset. Decide which of your parameters are numerical and which are categorical and adapt this in the config.R file.

Error analysis:

Run run_imputation_per_parameter.R and evaluate_error.R for your numerical and categorical data to evaluate the error.
Run plot_error_results.R and run_error_statistical_analysis.R for plotting and the respective p values.

Predictive model analysis:

Decide on a parameter that should be predicted and change it in the respective predictive model files.
Run run_pred_model_across_parameter.R and run_pred_model_per_parameter.R for the AUC evaluation.
Run plot_perf_across_parameters.R and plot_perf_per_parameter.R to get the plots.
Run run_pred_model_statistical_analysis.R for the statistical analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
code		code
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

A Public Framework for Testing Different Imputation Methods for Clinical Datasets

Contents

Project aim

Folder structure

Code files

How to run the analysis

About

Releases

Packages

Languages

License

tabeak/missing-value-analysis

Folders and files

Latest commit

History

Repository files navigation

A Public Framework for Testing Different Imputation Methods for Clinical Datasets

Contents

Project aim

Folder structure

Code files

How to run the analysis

About

Resources

License

Stars

Watchers

Forks

Languages