You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MS data can be stored with the MsBackendSql in any SQL database system supported by R/DBI (i.e. for which a dedicated R package is available). Here I compare performance of accessing MS data stored in either a SQLite and MariaDB database. Some properties:
Both databases (SQLite and MariaDB) stored on the same hard disk/partition (internal nVME disk, thus high data I/O is expected).
LC-MS data from 8,804 samples (mzML) files stored to the databases: in total 15,151,673 spectra.
Size of the SQLite database: 825GB
Size of the MariaDB database: 836GB
MariaDB database uses the Aria storage engine.
mse_maria and mse_sqlite below are two MsExperiment objects with the MS data represented by a MsBackendOfflineSQL backend.
About the same performance from both. filterRt uses a SQL-based filtering on the "rtime" spectra variable, i.e. performs the filtering within the database.
Next we subset the data to spectra from 10 random samples and evaluate also access to this data subset. Note that in general, for data analysis, the MS data will be processed per sample.
#' Access data from random 10 samples.
set.seed(123)
idx<- sample(seq_along(mse_maria), 10)
mse_maria_sub<-mse_maria[idx]
mse_sqlite_sub<-mse_sqlite[idx]
microbenchmark(msLevel(spectra(mse_maria_sub)),
msLevel(spectra(mse_sqlite_sub)),
times=7)
Unit:millisecondsexprminlqmeanmedianuqmaxnevalcld
msLevel(spectra(mse_maria_sub)) 45.6059345.7921046.4210146.2971746.9238547.612077a
msLevel(spectra(mse_sqlite_sub)) 24.4934325.1283325.2388425.1752525.4730625.800407b
Again, accessing a single spectra variables is faster with SQLite.
#' Filtering by retention time in the data subset
microbenchmark(filterRt(spectra(mse_maria_sub), rt= c(200, 300)),
filterRt(spectra(mse_sqlite_sub), rt= c(200, 300)),
times=7)
## Unit: milliseconds## expr min lq mean median## filterRt(spectra(mse_maria_sub), rt = c(200, 300)) 37.27201 38.1075 39.84782 39.81546## filterRt(spectra(mse_sqlite_sub), rt = c(200, 300)) 2320.30354 2324.0109 2326.98962 2325.46668## uq max neval cld## 41.74291 42.14641 7 a## 2328.67538 2337.78457 7 b
Filtering by retention time within the data subset is much faster using the MariaDB database.
Performance of accessing peaks data from the data subsets is about the same. At last we compare the performance of a frequently used task for LC-MS data analysis (with the xcms package): extracting the MS data in chromatographic representation. Below we use chromatogram to extract base peak chromatograms of the MS data per sample.
Here the MariaDB database clearly outperforms the SQLite database. The used SQL query combines both the primary keys of the spectra for the data subset and the retention times of these spectra.
Summary
For most operations both SQLite and MariaDB database engines are about equally performant.
For data access involving more complex queries (i.e. that combine retention time values and primary keys such as for filtering spectra within a subset of samples from the full data set) MariaDB has clear advantages while for plain access of individual spectra variables SQLite is faster.
The text was updated successfully, but these errors were encountered:
MS data can be stored with the MsBackendSql in any SQL database system supported by R/DBI (i.e. for which a dedicated R package is available). Here I compare performance of accessing MS data stored in either a SQLite and MariaDB database. Some properties:
mse_maria
andmse_sqlite
below are twoMsExperiment
objects with the MS data represented by aMsBackendOfflineSQL
backend.SQLite
is thus about 10 seconds faster extracting MS levels for all spectra.uniqueMsLevels
uses aselect distinct...
call to extract unique MS levels. MariaDB is here by far faster.About the same performance from both.
filterRt
uses a SQL-based filtering on the"rtime"
spectra variable, i.e. performs the filtering within the database.Next we subset the data to spectra from 10 random samples and evaluate also access to this data subset. Note that in general, for data analysis, the MS data will be processed per sample.
Again, accessing a single spectra variables is faster with SQLite.
Filtering by retention time within the data subset is much faster using the MariaDB database.
Performance of accessing peaks data from the data subsets is about the same. At last we compare the performance of a frequently used task for LC-MS data analysis (with the xcms package): extracting the MS data in chromatographic representation. Below we use
chromatogram
to extract base peak chromatograms of the MS data per sample.Performance is comparable. At last we combine that also with a filter for retention times.
Here the MariaDB database clearly outperforms the SQLite database. The used SQL query combines both the primary keys of the spectra for the data subset and the retention times of these spectra.
Summary
The text was updated successfully, but these errors were encountered: