Global normalization methods and their assumptions
Global normalization methods such as quantile normalization have become a standard part of the analysis pipeline for high-throughput data to remove unwanted technical variation. These methods and others that rely solely on observed data without external information (e.g. spike-ins) are based on the assumption that only a minority of genes are expected to be differentially expressed (or that an equivalent number of genes increase and decrease across biological conditions. This assumption can be interpreted in different ways leading to different global normalization procedures. For example, in one normalization procedure, the method assumes the mean expression level across genes should be the same across samples. In contrast, quantile normalization assumes the only difference between the statistical distribution of each sample is technical variation. Normalization is achieved by forcing the observed distributions to be the same and the average distribution, obtained by taking the average of each quantile across samples, is used as the reference.
How to evaluate if global normalization methods are appropriate?
While these assumptions may be reasonable in certain experiments, they may not always be
appropriate. Recently, an R/Bioconductor package (
has been developed to test for global differences between groups of distributions to evaluate whether
global normalization methods such as quantile normalization should be applied. If global differences
are found between groups of distributions, these changes may be of technical or biological of interest.
If these changes are of technical interest (e.g. batch effects), then global normalization methods should be applied.
If these changes are related to a biological factor (e.g. normal/tumor or two tissues), then
global normalization methods should not be applied because the methods will remove the interesting biological variation
(i.e. differentially expressed genes) and artificially induce differences between genes that were not
differentially expressed. In the cases with global differences between groups of distributions
between biological conditions, quantile normalization is not an appropriate normalization method. In
these cases, we can consider a more relaxed assumption about the data, namely that the statistical distribution
of each sample should be the same within biological conditions or groups (compared to the more
stringent assumption of quantile normalization, which states the statistical distribution is the same across all samples).
qsmooth: a generalization of quantile normalization
Here we introduce a generalization of quantile normalization, referred to as
smooth quantile normalization
(qsmooth), which is a weighted average of the two types of assumptions about the data.
The qsmooth R-package contains the
qsmooth() function, which computes a weight at every quantile
that compares the variability between groups relative to within groups. In one extreme, quantile normalization
is applied and in the other extreme quantile normalization within each biological condition is applied.
The weight shrinks the group-level quantile normalized data towards the overall reference quantiles
if variability between groups is sufficiently smaller than the variability within groups. The algorithm is described in the
Figure below (see the
vignettes/qsmooth-vignette.pdf for more details).
Use devtools to install the latest version of qsmooth from Github:
After installation, the package can be loaded into R.
The main function in the qsmooth package is
qsmooth() function needs two objects:
(1) a data frame or matrix with observations (e.g. probes or genes) on the rows and samples as the columns
(e.g. let's call it
eset) and (2) a group level factor called
groupFactor (let's call it
This order of this factor variable must match the order of the columns in the
eset object because it contains
information about which group each sample is from.
To run the
qs <- qsmooth(object = eset, groupFactor = outcome)
Individual slots can be extracted using accessor methods:
qsmoothData(qs) # extract smoothed quantile normalized data qsmoothWeights(qs) # extract smoothed quantile normalized weights
The weights can be directly plotted using the
qsmoothPlotWeights(qs) # plot weights
vignettes/qsmooth-vignette.pdf for more details.
Report bugs as issues on the GitHub repository