Fabien GIRKA, Etienne CAMENEN, Caroline PELTIER, Vincent GUILLEMOT, Arnaud GLOAGUEN, Laurent LE BRUSQUET, Arthur TENENHAUS
Regularized Generalized Canonical Correlation Analysis, multi-block data analysis
arthur.tenenhaus@centralesupelec.fr
Performs multiblock component methods (PCA, CCA, PLS, MCOA, GCCA, CPCA, MAXVAR, R/SGCCA, etc.) and produces graphical outputs (e.g. variables and individuals plots) and statistics to assess the robustness/significance of the analysis.
- Description
- Algorithm
- Installation
- Installation of a development branch from the git repository
- References
A package for multiblock data analysis (RGCCA - Regularized Generalized Canonical Correlation Analysis) as described in [1-4]. The software produces graphical outputs and statistics to assess the robustness/significance of the analysis.
We consider
RGCCA subsumes fifty years of multiblock component methods and is defined as the following optimization problem:
-
The scheme function
$g$ is any continuous convex function and allows to consider different optimization criteria. Typical choices of$g$ are the identity (horst scheme, leading to maximizing the sum of covariances between block components), the absolute value (centroid scheme, yielding maximization of the sum of the absolute values of the covariances), the square function (factorial scheme, thereby maximizing the sum of squared covariances), or, more generally, for any even integer$m$ ,$g(x) = x^m$ ($m$ -scheme, maximizing the power of$m$ of the sum of covariances). The horst scheme penalizes structural negative correlation between block components while both the centroid scheme and the$m$ -scheme enable two components to be negatively correlated. According to [5], a fair model is a model where all blocks contribute equally to the solution in opposition to a model dominated by only a few of the$J$ sets. If fairness is a major objective, the user must choose$m = 1$ .$m > 1$ is preferable if the user wants to discriminate between blocks. In practice,$m$ is equal to 1, 2 or 4. The higher the value of$m$ the more the method acts as block selector [5]. -
The design matrix
$\mathbf C$ is a symmetric$J \times J$ matrix of nonnegative elements describing the network of connections between blocks the user wants to take into account. Usually,$c_{jk} = 1$ for two connected blocks and 0 otherwise. -
The
$\tau_j$ are called shrinkage parameters or regularization parameters ranging from 0 to 1.$\tau_j$ enables interpolate smoothly between maximizing the covariance and maximizing the correlation. Setting the$\tau_j$ to 0 will force the block components to unit variance ($\text{var}(\mathbf X_j \mathbf a_j) = 1$ ). In this case, the covariance criterion boils down to the correlation. The correlation criterion is better in explaining the correlated structure across datasets, thus discarding the variance within each individual dataset. Setting$\tau_j$ to 1 will normalize the block weight vectors ($\Vert \mathbf a_j \Vert = 1$ ), which applies the covariance criterion. A value between 0 and 1 will lead to a compromise between the two first options and correspond to the following constraint$(1 − \tau_j) \text{var}(\mathbf X_j \mathbf a_j) + \tau_j \Vert \mathbf a_j \Vert^2 = 1$ . In the RGCCA package, for each block, the determination of the shrinkage parameter can be made fully automatic by using the analytical formula proposed by (Schäfer and Strimmer 2005 [6]), by permutation or K fold cross-validation. Moreover, we can define the choice of the shrinkage parameters by providing interpretations on the properties of the resulting block components:-
$\tau_j = 1$ yields the maximization of a covariance-based criterion. It is recommended when the user wants a stable component (large variance) while simultaneously taking into account the correlations between blocks. The user must, however, be aware that variance dominates over correlation. -
$\tau_j = 0$ yields the maximization of a correlation-based criterion. It is recommended when the user wants to maximize correlations between connected components. This option can yield unstable solutions in case of multi-collinearity and cannot be used when a data block is rank deficient (e.g.$n < p_j$ ). -
$0 < \tau_j < 1$ is a good compromise between variance and correlation: the block components are simultaneously stable and as well correlated as possible with their connected block components. This setting can be used when the data block is rank deficient.
-
The quality and interpretability of the RGCCA block components
Required:
-
Software: R (≥ 3.2.0)
-
R libraries: see the DESCRIPTION file.
install.packages("RGCCA")
See the vignette for an introduction to the package.
Required:
-
Software: R (≥ 3.2.0)
-
R libraries: see the DESCRIPTION file.
-
The R library
devtools
.
remove.packages("RGCCA")
devtools::install_github(repo="https://github.com/rgcca-factory/RGCCA.git", ref = "main")
- Tenenhaus, M., Tenenhaus, A., & Groenen, P. J. (2017). Regularized generalized canonical correlation analysis: a framework for sequential multiblock component methods. Psychometrika, 82(3), 737-777.
- Tenenhaus, A., Philippe, C., & Frouin, V. (2015). Kernel generalized canonical correlation analysis. Computational Statistics & Data Analysis, 90, 114-131.
- Tenenhaus, A., Philippe, C., Guillemot, V., Le Cao, K. A., Grill, J., & Frouin, V. (2014). Variable selection for generalized canonical correlation analysis. Biostatistics, 15(3), 569-583.
- Tenenhaus, A., & Tenenhaus, M. (2011). Regularized generalized canonical correlation analysis. Psychometrika, 76(2), 257.
- Van de Geer, J. P. (1984). Linear relations among K sets of variables. Psychometrika, 49(1), 79-94.
- Schäfer, J., & Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology, 4(1).
- Tenenhaus, A., & Tenenhaus, M. (2014). Regularized generalized canonical correlation analysis for multiblock or multigroup data analysis. European Journal of operational research, 238(2), 391-403.