Skip to content

A script designed to impute missing values in Metabolon metabolomics datasets. Two imputation methods can be used: MICE and KNN This script is designed to impute missing values in Metabolon HD4 datasets using KNN and MICE imputations. It is technically possible to use it with any metabolomic dataset.

License

Notifications You must be signed in to change notification settings

tofaquih/imputation_of_untargeted_metabolites

Repository files navigation

DOI Binder

Please cite as:

Tariq Faquih. (2022). tofaquih/imputation_of_untargeted_metabolites: Official Release v1.4 (v1.4). Zenodo. https://doi.org/10.5281/zenodo.6347808

Faquih T, van Smeden M, Luo J, le Cessie S, Kastenmüller G, Krumsiek J, Noordam R, van Heemst D, Rosendaal FR, van Hylckama Vlieg A, Willems van Dijk K, Mook-Kanamori DO. A Workflow for Missing Values Imputation of Untargeted Metabolomics Data. Metabolites. 2020 Nov 26;10(12):486. doi: 10.3390/metabo10120486. PMID: 33256233; PMCID: PMC7761057.

Introduction

A script designed to impute missing values in Metabolon metabolomics datasets. Two imputation methods can be used: MICE and KNN This script is designed to impute missing values in Metabolon HD4 datasets using KNN and MICE imputations. It is technically possible to use it with any metabolomic dataset.

Workflow:

  • The user provides two lists of metabolites: group1 to be imputed using MICE-pmm or kNN-obs-sel; group2 to be impute with zero.
  • The user must also provide the variables t be used in the analysis after the imputation including the outcome.
  • A correlation matrix is created for all the group1 metabolites.
  • The group1 metabolites are split to complete cases metabolites(ccm) and incomplete cases metabolites(icm).
  • For each icm 10 ccm with the highest absolute R correlation are selected. These will be used to impute the missing values in icm.
    • if the icm has more than 90% missing values OR if the number of non-missing values in less than the number of predictor variables + 20.
    • This is done to because of two reasons:
    • to eliminate possibly mis-annotated metabolites or unannotated metabolites that are xenobiotic in nature,
    • To ensure the availability of enough cases to perform the imputation.
    • This issue prevents the MICE package from performing the imputation all together.
    • The invalid cases will be imputed to zero.
  • The imputed results are returned with 3 objects; The imputed data, the summary of the imputation, the mean R of the ccm used for each icm.

How to use:

  1. Import the script source(UnMetImp.R)
  2. Create a vector with the names of the group1 (endogenous and/or unannotated metabolites) and group2 (xenobiotics) metabolites.
  3. Use the UnMetImp function.
  • Usage
    • UnMetImp(DataFrame , imp_type = 'mice' , number_m = 5 , group1 , group2 = NULL , outcome=NULL, covars=NULL, fileoutname = NULL , use_covars = FALSE , logScale = TRUE , covars_only_mode = FALSE , maxN_input = 10)
  • Arguments
    • DataFrame: The full dataframe to be used with all the metabolites, covariables and the outcome. Must numeric. Must be a dataframe.
    • imp_type: String. Type of imputation to be used: mice or knn. Default is mice.
    • number_m: Numeric. For imp_type == "mice" only. Number of imputations to be used. Default = 5.
    • group1: Vector. Required. Vector with the names of metabolite columns. Will be imputed using the provided imp_type.
    • group2: Vector. Optional. Vector with the names of metabolite columns. Will be imputed to zero.
    • outcome: String. Required. The outcome variable to be used in the future analysis.
    • covars: Vector. Recommended. variables used in the future analysis. Will be returned with the imputed data.
    • fileoutname: String value. Optional. Saves the imputed output to a file.
    • use_covars: Logical. Optional. Whether the covars will be used to impute the missing values in the metabolites. Default = FALSE.
    • logScale: Logical. Optional. Whether the values need to be log and scaled for the imputation. if TRUE, the values will be log and scaled then un-log and unscaled before returning the imputed output. If FALSE, script will assume you have log the values. Default = TRUE.
    • covars_only_mode: an option to only use the covariables for the imputation, ignoring all other metabolites. Useful in case of collinear/constant variables. only works with mice imputation
    • maxN_input: sets the max number of ccm metabolites to be used for the imputation. Default is 10. Is overridden if covars_only_mode == TRUE. Useful in case of collinear/constant variables.

Examples of running the scripts:

Please try out the script using the provided NEO metabolomics data (with simulated charaterists variables) in the Binder docker Binder

  • mydata: your data table in dataframe class format.
  • endoids: a user created vector containing the column names of the endogenous metabolites
  • unknowns: a user created vector containing the column names of the unannotated metabolites
  • xeono: a user created vector containing the column names of the xenobiotic metabolites

Running default knn:

source('Master_Script.r')

knnimp <- UnMetImp(DataFrame = mydata, group1 = c( endoids , unknowns ) , group2= xeon , covars = c('age', 'sex'), imp_type = 'knn' ,outcome = c('BMI') , logScale = TRUE )

Running default MICE:

source('Master_Script.r')

miceimp <- UnMetImp(DataFrame = mydata, group1 = c( endoids , unknowns ) , group2= xeon , covars = c('age', 'sex'), imp_type = 'mice', number_m = 5, outcome = c('BMI') , use_covars = FALSE , logScale = TRUE)

The mice package stores the output from the imputation step into the object class mids by default. This stores information about the imputation process used and the imputation datasets created. The with() and pool() need the object class mids as input to run the analysis on the datasets, calculate the estimate for each dataset then pools the estimates and standard errors using Rubin’s Rules. To convert the object class mids to a “long” format:

require(‘mice’)

IMP <- miceimp$mids

Longformat =<-complete(IMP , action = 'long' , include = TRUE)

To convert the Longformat back to mids class:

IMP <- as.mids(Longformat)

To run the analysis on the mids class and pool the estimates by Rubin's Rules:

Model_Formula = as.formula('BMI~age+sex+...')

Mysummary <- summary(pool( with(data = IMP, expr = lm( formula = Model_Formula ) )) ,conf.int = TRUE)

About

A script designed to impute missing values in Metabolon metabolomics datasets. Two imputation methods can be used: MICE and KNN This script is designed to impute missing values in Metabolon HD4 datasets using KNN and MICE imputations. It is technically possible to use it with any metabolomic dataset.

Resources

License

Stars

Watchers

Forks

Packages