This R notebook can be run with mybinder: 

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/fchatelain/Examples_R_1A.git/master?filepath=pca-olympic_example.ipynb)


# Olympic decathlon data

###### This example is a short introduction to PCA analysis (using R language). The Data are performance marks on the ten decathlon events for 33 athletes at the Olympic Games (1988).

## Dataset
Load olympic dataset contained in the package ade4

In [None]:
library(ade4)
data(olympic)

Display some descriptive statistics for this dataset

In [None]:
summary(olympic$tab)


## PCA

Make *PCA* on decathlon event scores data $X \in \mathbb{R}^{n \times p}$: $n=33$ samples (athletes), $p=10$ variables/features (decathlon events)

In [None]:
pca.olympic = princomp(olympic$tab)


How is the distribution of component variances/eigenvalues $\sigma_i^2$, $1 \le i \le p$ ? Let's visualize the **screeplot**

In [None]:
plot(pca.olympic) # screeplot (graphical display)

In [None]:
#cumPropVar= cumsum(pca.olympic$sdev^2)/sum(pca.olympic$sdev^2) 
summary(pca.olympic) # summary (numerical display)

The biplot gives a graphical summary of both samples (athletes) in terms of scores and the variables/features in terms of loadings

In [None]:
biplot(pca.olympic)

From this plot, we see that the first principal component is positively associated with longer times on the 1500. So, slower runners will have higher value on this component.

In [None]:
cat('average 1500 event score (seconds) == ', mean( olympic$tab[, "1500"] ) )
data.frame( olympic$tab[, "1500"], pca.olympic$scores[, 1])

In [None]:
 plot(olympic$tab[, "1500"], pca.olympic$scores[, 1], pch = 23, bg = "red", cex = 2)

Also, the second principal component is associated with strength in the form of a long javelin throw.

In [None]:
 plot(olympic$tab[, "jave"], pca.olympic$scores[, 2], pch = 23, 
      bg = "red", cex = 2)

Standardizing
---------------

In the previous example, we saw that the two variables were based somewhat on speed and strength. However, 
**we did not scale the variables** so the 1500 has much more weight than the 400, for instance. We correct this by passing the cor=TRUE argument, which defaults to FALSE, as an argument to princomp.

In [None]:
pca.olympic = princomp(olympic$tab, cor = TRUE)
biplot(pca.olympic)

This plot reinforces our earlier interpretation and has put the running events on an “even playing field” by standardizing.

In [None]:
pca.olympic$loadings[, 1]


In [None]:
pca.olympic$loadings[, 2]