# install.packages("devtools")
devtools::install_github("shengqh/DupChecker")
or from Bioconductor by following codes:
source("http://bioconductor.org/biocLite.R")
biocLite("DupChecker")
Here we show the most basic steps for a validation procedure. You need to create a target directory used to store the data. Here, we assume the target directory is your work directory.
library(DupChecker)
geoDownload(datasets = c("GSE14333", "GSE13067", "GSE17538"), targetDir=getwd())
datafile<-buildFileTable(rootDir=getwd(), filePattern="cel$")
result<-validateFile(datafile)
if(result$hasdup){
duptable<-result$duptable
write.csv(duptable, file="duptable.csv")
}
If the download or decompress cost too much time in R environment, user may download the GEO/ArrayExpress raw data and decompress the data to individual data files using other tools. The reason that we expect the data file not compressed CEL file is the compressed files from same CEL file but by different compress softwares may have different MD5 fingerprint.
The following code will download two datasets from ArrayExpress system and three datasets from GEO system. It may cost a few minutes to a few hours based your network performance.
library(DupChecker)
#download from ArrayExpress system
datatable<-arrayExpress(datasets = c("E-TABM-158", "E-TABM-43"), targetDir=getwd()))
datatable
#Or download from GEO system
datatable<-geoDownload(datasets = c("GSE14333", "GSE13067", "GSE17538"), targetDir=getwd())
datatable
The datatable is a data frame containing dataset name and how many CEL files in that dataset.
##Build file table
Secondly, function buildFileTable will try to find all files in the subdirectories under root directories user provided. The result data frame contains two columns, dataset and filename. Here, rootDir can also be an array of directories.
datafile<-buildFileTable(rootDir=getwd(), filePattern="cel$")
datafile
##Validate file redundancy
The function validateFile will calculate MD5 fingerprint for each file in table and then check to see if any two files have same MD5 fingerprint. The files with same fingerprint will be treated as duplication. The function will return a table contains all duplicated files and datasets.
result<-validateFile(datafile)
if(result$hasdup){
duptable<-result$duptable
write.csv(duptable, file="duptable.csv")
}
MD5 | GSE13067(64/74) | GSE14333(231/290) | GSE17538(167/244) |
---|---|---|---|
001ddd757f185561c9ff9b4e95563372 | GSM358397.CEL | GSM437169.CEL | |
00b2e2290a924fc2d67b40c097687404 | GSM358503.CEL | GSM437210.CEL | |
012ed9083b8f1b2ae828af44dbab29f0 | GSM327335 | GSM358620.CEL | |
023c4e4f9ebfc09b838a22f2a7bdaa59 | GSM358441.CEL | GSM437117.CEL |
If you use DupChecker in published research, please cite:
Quanhu Sheng, Yu Shyr, Xi Chen.: DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis. BMC bioinformatics 2014, 15:323.