# PCA- Basic Summary

Principal component analysis is used to extract the important information from a multivariate data table and to express this information as a set of few new variables called principal components. These new variables correspond to a linear combination of the originals. The number of principal components is less than or equal to the number of original variables.

The information in a given data set corresponds to the total variation it contains. The goal of PCA is to identify directions (or principal components) along which the variation in the data is maximal.

In other words, PCA reduces the dimensionality of a multivariate data to two or three principal components, that can be visualized graphically, with minimal loss of information.

PCA assumes that the directions with the largest variances are the most “important” (i.e, the most principal).

The amount of variance retained by each principal component is measured by the so-called eigenvalue.

The PCA method is particularly useful when the variables within the data set are highly correlated.

Correlation indicates that there is redundancy in the data. 

Due to this redundancy, PCA can be used to reduce the original variables into a smaller number of new variables ( = principal components) explaining most of the variance in the original variables.

Taken together, the main purpose of principal component analysis is to:

* identify hidden pattern in a data set,

* reduce the dimensionnality of the data by removing the noise and redundancy in the data,

* identify correlated variables

In [4]:
install.packages('FactoMineR')
install.packages('factoextra')
library(FactoMineR)
library(factoextra)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘nloptr’, ‘pbkrtest’, ‘lme4’, ‘car’, ‘flashClust’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘rstatix’, ‘ggpubr’


Loading required package: ggplot2

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa



**Below we will analyze Decathlon atheletes and their correponding features for PCA**

In [5]:
decathlon <- read.csv('../input/decathlon/decathlon.csv')

head(decathlon)

Unnamed: 0_level_0,X,X100m,Long.jump,Shot.put,High.jump,X400m,X110m.hurdle,Discus,Pole.vault,Javeline,X1500m,Rank,Points,Competition
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>,<chr>
1,SEBRLE,11.04,7.58,14.83,2.07,49.81,14.69,43.75,5.02,63.19,291.7,1,8217,Decastar
2,CLAY,10.76,7.4,14.26,1.86,49.37,14.05,50.72,4.92,60.15,301.5,2,8122,Decastar
3,KARPOV,11.02,7.3,14.77,2.04,48.37,14.09,48.95,4.92,50.31,300.2,3,8099,Decastar
4,BERNARD,11.02,7.23,14.25,1.92,48.93,14.99,40.87,5.32,62.77,280.1,4,8067,Decastar
5,YURKOV,11.34,7.09,15.19,2.1,50.42,15.31,46.26,4.72,63.44,276.4,5,8036,Decastar
6,WARNERS,11.11,7.6,14.31,1.98,48.68,14.23,41.1,4.92,51.77,278.1,6,8030,Decastar


In [7]:
library(tidyverse)
dec <- decathlon %>% select(-c(X,Competition))

# Data Standardization

In principal component analysis, variables are often scaled (i.e. standardized). This is particularly recommended when variables are measured in different scales (e.g: kilograms, kilometers, centimeters); otherwise, the PCA outputs obtained will be severely affected.

The goal is to make the variables comparable. Generally variables are scaled to have 

i) standard deviation one 

ii) mean zero.

In [8]:
res_pca <- PCA(dec,graph = FALSE)

print(res_pca)

**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 41 individuals, described by 12 variables
*The results are available in the following objects:

   name               description                          
1  "$eig"             "eigenvalues"                        
2  "$var"             "results for the variables"          
3  "$var$coord"       "coord. for the variables"           
4  "$var$cor"         "correlations variables - dimensions"
5  "$var$cos2"        "cos2 for the variables"             
6  "$var$contrib"     "contributions of the variables"     
7  "$ind"             "results for the individuals"        
8  "$ind$coord"       "coord. for the individuals"         
9  "$ind$cos2"        "cos2 for the individuals"           
10 "$ind$contrib"     "contributions of the individuals"   
11 "$call"            "summary statistics"                 
12 "$call$centre"     "mean of the variables"              
13 "$call$ecart.type" "standard error o

In [11]:
summary(res_pca)


Call:
PCA(X = dec, graph = FALSE) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
Variance               4.759   1.740   1.415   1.132   0.862   0.607   0.510
% of var.             39.657  14.501  11.791   9.431   7.183   5.061   4.254
Cumulative % of var.  39.657  54.158  65.949  75.380  82.563  87.624  91.878
                       Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
Variance               0.411   0.235   0.187   0.141   0.000
% of var.              3.426   1.960   1.561   1.175   0.000
Cumulative % of var.  95.303  97.264  98.825 100.000 100.000

Individuals (the 10 first)
                 Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
1            |  2.833 |  1.505  1.161  0.282 |  0.704  0.694  0.062 |  0.942
2            |  3.754 |  1.557  1.243  0.172 |  0.555  0.432  0.022 |  2.189
3            |  3.602 |  1.600  1.312  0.197 |  0.463  0.300  0.016 |  2.057
4            |  2.957 |  0.082  0.003  0.001 | -0.978  1.340  

**Eigenvalues /Variances
the eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.**

# Eigenvalues

In [12]:
eig_val <- get_eigenvalue(res_pca)

eig_val

Unnamed: 0,eigenvalue,variance.percent,cumulative.variance.percent
Dim.1,4.75879,39.65659,39.65659
Dim.2,1.740146,14.50122,54.1578
Dim.3,1.414902,11.79085,65.94866
Dim.4,1.131778,9.431483,75.38014
Dim.5,0.8619423,7.182852,82.56299
Dim.6,0.6073189,5.060991,87.62398
Dim.7,0.5104506,4.253755,91.87774
Dim.8,0.4110845,3.425704,95.30344
Dim.9,0.2352087,1.960072,97.26351
Dim.10,0.1873636,1.561364,98.82488


The proportion of variation explained by each eigenvalue is given in the second column. For example, 4.75 divided by 12 equals 0.39, or, about 39.65% of the variation is explained by this first eigenvalue. 

The cumulative percentage explained is obtained by adding the successive proportions of variation explained to obtain the running total. For instance, 39.65% plus 14.50% equals 54.17%, and so forth. Therefore, about 54.17% of the variation is explained by the first two eigenvalues together.

An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point for which PCs are retained. This holds true only when the data is standardised

You can also limit the number of component to that number that accounts for a certain fraction of the total variance. For example, if you are satisfied with 70% of the total variance explained then use the number of components to achieve that.

In our analysis, the first four principal components explain 75% of the variation. This is an acceptably large percentage.

We can also create a scree plot

In [None]:
fviz_eig( res_pca, addlabels = TRUE, ylim = c( 0, 50)) 

![Rplot001.png](attachment:b64a8d71-880d-4d4b-b93d-cd2bacaff48f.png)

From the plot above, we might want to stop at the 6th principal component. 87% of the information (variances) contained in the data are retained by the first five principal components.

A simple method to extract the results, for variables, from a PCA output is to use the function get_pca_var() [factoextra package].

This function provides a list of matrices containing all the results for the active variables (coordinates, correlation between variables and axes, squared cosine and contributions)

In [9]:
var <- get_pca_var(res_pca)
var

Principal Component Analysis Results for variables
  Name       Description                                    
1 "$coord"   "Coordinates for the variables"                
2 "$cor"     "Correlations between variables and dimensions"
3 "$cos2"    "Cos2 for the variables"                       
4 "$contrib" "contributions of the variables"               

**The correlation between a variable and a principal component is used as the coordinate of the variable on the PC.**

In [16]:
head(var$coord,4)

Unnamed: 0,Dim.1,Dim.2,Dim.3,Dim.4,Dim.5
X100m,-0.7081629,0.1576287,-0.15438506,0.206049535,0.5140831
Long.jump,0.7559094,-0.3329181,0.18223996,0.007219957,-0.04747772
Shot.put,0.611649,0.6122609,-0.01992738,0.109366452,-0.07290452
High.jump,0.5878896,0.359724,-0.23912699,-0.08658869,0.4030039


To plot the variables in the top two dimension

In [None]:
fviz_pca_var( res_pca, col.var = "black")

![Rplot001 (1).png](attachment:a59b46ac-eb93-402b-9292-8be117f10bc2.png)

The above plot is **variable correlation plots**. It shows the relationships between all variables. It can be interpreted as follow:

* Positively correlated variables are grouped together.

* Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants).

* The distance between variables and the origin measures the quality of the variables on the factor map.

* Variables that are away from the origin are well represented on the factor map.

# Quality of Representation

**The quality of representation of the variables on factor map is called cos2** (square cosine, squared coordinates)

In [18]:
head(var$cos2,4)

Unnamed: 0,Dim.1,Dim.2,Dim.3,Dim.4,Dim.5
X100m,0.5014947,0.02484679,0.0238347454,0.04245641,0.264281435
Long.jump,0.571399,0.11083443,0.0332114038,5.212778e-05,0.002254133
Shot.put,0.3741145,0.37486336,0.0003971003,0.01196102,0.005315068
High.jump,0.3456141,0.12940137,0.0571817192,0.007497601,0.162412145


In [11]:
library(corrplot)
library(GGally)

corrplot 0.92 loaded

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2



**Correlation Plot**

In [None]:
corrplot(var$cos2,is.corr=FALSE)

![Rplot001 (2).png](attachment:cfc3054d-c646-4a19-ad62-1eb1ebd0d680.png)

In [None]:
ggcorr(var$cos2, palette = "RdBu", label = TRUE)

![Rplot001 (3).png](attachment:ed1b1fc3-ff11-402c-b0fb-e92797675b71.png)

**Bar plot visualization**

In [None]:
#Total cos2 of variables on Dim. 1 and Dim. 2
fviz_cos2( res_pca, choice = "var", axes = 1: 2)

![Rplot001 (4).png](attachment:f4d7d448-5cb5-406a-9002-215ee335b057.png)

A high cos2 indicates a good representation of the variable on the principal component. In this case the variable is positioned close to the circumference of the correlation circle.

A low cos2 indicates that the variable is not perfectly represented

It’s possible to color variables by their cos2 values using the argument col.var = “cos2”. This produces a gradient colors. In this case, the argument gradient.cols can be used to provide a custom color. For instance, gradient.cols = c(" white“,”blue“,”red“) means that: variables with low cos2 values will be colored in”white" variables with mid cos2 values will be colored in “blue” variables with high cos2 values will be colored in red

For a given variable, the sum of the cos2 on all the principal components is equal to one.

If a variable is perfectly represented by only two principal components (Dim.1 & Dim.2), the sum of the cos2 on these two PCs is equal to one. In this case the variables will be positioned on the circle of correlations.

For some of the variables, more than 2 components might be required to perfectly represent the data. In this case the variables are positioned inside the circle of correlations.

In [None]:
# Color by cos2 values: quality on the factor map 
fviz_pca_var(res_pca, col.var = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping 
             )

![Rplot001 (5).png](attachment:85071df2-22e7-44a4-9a40-95733eeec48d.png)

* variables with low cos2 values will be colored in “white”

* variables with mid cos2 values will be colored in “blue”

* variables with high cos2 values will be colored in red

# Contributions of variables to PCs

The contributions of variables in accounting for the variability in a given principal component are expressed in percentage.

Variables that are correlated with PC1 (i.e., Dim.1) and PC2 (i.e., Dim.2) are the most important in explaining the variability in the data set.

Variables that do not correlated with any PC or correlated with the last dimensions are variables with low contribution and might be removed to simplify the overall analysis.

The contribution of variables can be extracted as follow :

In [22]:
head(var$contrib,10)

Unnamed: 0,Dim.1,Dim.2,Dim.3,Dim.4,Dim.5
X100m,10.5382802,1.427857,1.68455054,3.75130206,30.6611523
Long.jump,12.0072309,6.36926,2.34725764,0.004605831,0.2615179
Shot.put,7.8615458,21.542062,0.02806556,1.056834544,0.6166385
High.jump,7.2626467,7.436236,4.04138977,0.662462192,18.8425779
X400m,10.1582977,17.675417,1.22207996,0.695168032,0.6447135
X110m.hurdle,10.3979678,2.415117,0.4642881,14.767777108,1.4438507
Discus,5.8889946,21.99468,0.2109777,6.593926298,1.0997592
Pole.vault,0.4408398,2.336074,35.050344,30.656386097,1.2009404
Javeline,2.1378946,5.955215,10.77628204,31.667171084,21.5978461
X1500m,0.2312995,12.647165,43.19875598,3.209553198,1.5257759


In [None]:
corrplot(var$contrib, is.corr=FALSE)   

![Rplot001 (6).png](attachment:3f849431-fb1f-4156-bb1e-d9cde1fd683f.png)

**Plot shows the top 10 variables contributing to the principal components:**

In [None]:
# Contributions of variables to PC1
fviz_contrib(res_pca, choice = "var", axes = 1, top = 10)
# Contributions of variables to PC2
fviz_contrib(res_pca, choice = "var", axes = 2, top = 10)

![Rplot001 (7).png](attachment:01e76b8d-68d6-4cd9-a3b8-8aa356a6437b.png)

![Rplot002.png](attachment:9a8dd04c-376e-4f74-be27-937b39cdfcaf.png)

In [None]:
fviz_contrib(res_pca, choice = "var", axes = 1:2, top = 10)

![Rplot001 (12).png](attachment:31d6b31c-14de-4dae-95b5-f08ca7a24492.png)

The red dashed line on the graph above indicates the expected average contribution. If the contribution of the variables were uniform, the expected value would be 1/length(variables) = 1/10 = 10%. For a given component, a variable with a contribution larger than this cutoff could be considered as important in contributing to the component.

The total contribution of a given variable, on explaining the variations retained by two principal components, say PC1 and PC2, is calculated as contrib = [(C1 * Eig1) + (C2 * Eig2)]/(Eig1 + Eig2), where

C1 and C2 are the contributions of the variable on PC1 and PC2, respectively
Eig1 and Eig2 are the eigenvalues of PC1 and PC2, respectively .Eigenvalues measure the amount of variation retained by each PC.

In this case, the expected average contribution (cutoff) is calculated as follow: 

As mentioned above, if the contributions of the 10 variables were uniform, the expected average contribution on a given PC would be 1/10 = 10%. The expected average contribution of a variable for PC1 and PC2 is : [(10* Eig1) + (10 * Eig2)]/(Eig1 + Eig2)

It can be seen that the variables - X400m, Shot put and Long Jump,Discuss - contribute the most to the dimensions 1 and 2.

**The most important (or, contributing) variables can be highlighted on the correlation plot as follow:**

In [None]:
fviz_pca_var(res_pca, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")
             )

![Rplot001 (9).png](attachment:33c9d989-875e-4898-abc6-63c3aa992ad7.png)

it’s also possible to change the transparency of variables according to their contrib values using the option alpha.var = "contrib". For example, type this:

In [None]:
# Change the transparency by contrib values
fviz_pca_var(res.pca, alpha.var = "contrib")

It’s also possible to change the color of variables by groups defined by a qualitative/categorical variable, also called factor 
As we don’t have any grouping variable in our data sets for classifying variables, we’ll create it.

We start by classifying the variables into 3 groups using the kmeans clustering algorithm. Next, we use the clusters returned by the kmeans algorithm to color variables.

In [None]:
# Create a grouping variable using kmeans
# Create 3 groups of variables (centers = 3)
set.seed(123)
res.km <- kmeans(var$coord, centers = 3, nstart = 25)
grp <- as.factor(res.km$cluster)
# Color variables by groups
fviz_pca_var(res_pca, col.var = grp, 
             palette = c("#0073C2FF", "#EFC000FF", "#868686FF"),
             legend.title = "Cluster")

![Rplot001 (10).png](attachment:f7311c9b-0ff6-4155-b967-9c4a0ce2cb7f.png)

# Dimension description

**Below we describe how to highlight variables according to their contributions to the principal components.**

In [19]:
res.desc <- dimdesc(res_pca, axes = c(1,2), proba = 0.05)
# Description of dimension 1
res.desc$Dim.1

Unnamed: 0,correlation,p.value
Points,0.9816991,1.051268e-29
Long.jump,0.7559094,1.102973e-08
Shot.put,0.611649,2.150346e-05
High.jump,0.5878896,5.307214e-05
Discus,0.5293816,0.0003723514
Javeline,0.3189638,0.04208894
X400m,-0.6952784,4.540305e-07
X110m.hurdle,-0.7034326,2.906722e-07
X100m,-0.7081629,2.228788e-07
Rank,-0.7811767,1.678628e-09
