#1 Intro

This notebook illustrates how to perform a principal components analysis (PCA) in R. We will use the country risk dataset. The dataset contains variables of risk measures, GDP growth and population for about 120 countries. (Which year the data were collected is unknown.)

#2 Load the data

We first load the dataset and display its summary statistics.

We also note that the numerical variables in this dataset are measured in different units. We discuss why this leads us to scale each of those variables to have mean zero and standard deviation one before we perform PCA.

In [None]:
# load the readxl package (for importing Excel datasets)
library(readxl)

In [None]:
# download the dataset first because read_excel() from readxl package doesn't support reading Excel file from a URL directly
data_url <- "https://github.com/tdmdal/datasets-teaching/raw/main/crisk/country_risk.xlsx"
download.file(url = data_url, destfile = "country_risk.xlsx")

In [None]:
# import the data to a dataframe
country_risk <- read_xlsx(path = "country_risk.xlsx", sheet = "raw_kmeans", skip = 1)
head(country_risk)

In [None]:
# take a look at the structure (str) of the dataframe/tibble
str(country_risk)

You can find the variable description in the Excel data file (in sheet `data_description`), which I also copy them below. The GDP Growth (in percentage) and Population variables are self-explanatory.

| Variable   | Description                                                                      |
|------------|----------------------------------------------------------------------------------|
| Corruption | Corruption index is on a scale from 0 (high corruption) to 100 (no   corruption) |
| Peace      | Peace index is on a scale from 1 (very peaceful) to 5 (not at all   peaceful)   |
| Legal      | Legal risk index is on a scale from 0 (high legal risk) to 10 (no legal   risk)  |

In [None]:
# display summary statistics
summary(country_risk)

In [None]:
# display a covariance matrix for numerical columns
# mainly to demonstrate that the numerical variables have very different variances
var(country_risk[c("Corruption", "Peace", "Legal", "GDP Growth", "Population")])

Note that the variables are measured in different scales (or units). For example, the `Corruption` index variable is measured on a scale of 0 (high corruption) to 100 (no corruption) but the `Peace` variable is measured on a scale of 1 (very peaceful) to 5 (not at all peaceful). This difference in scales contributes to the difference in variances between the variables, and because variables with larger variances get larger first principal component loadings, difference in scales can affect PCA result. Since it is undesirable for the principal components obtained to depend on arbitrary choice of scaling (or units), before performing PCA we usually rescale each variable to have mean zero and standard deviation one.

#3 PCA


##3.1 Run the PCA algorithm

We use the `prcomp()` function in the base R `stats` library/package to perform PCA. (There are other functions in R that perform PCA too.) By default, `prcomp()` centers variables to have mean zero, i.e., the `center` argument of `prcomp()` is already set to `TRUE` by default. To scale the data to have standard deviation one, we need to set the `scale` argument in `prcomp()` to `TRUE` as its default value is `FALSE`.

In [None]:
# PCA; Note we set scale = TRUE
pca_result <- prcomp(country_risk[c("Corruption", "Peace", "Legal", "GDP Growth", "Population")], scale = TRUE)

##3.2 Understand the output

In [None]:
# prcomp() returns a named list; display its data structure
str(pca_result)

We see the `prcomp()` returns a named list. The list stores all the results of the PCA. For example, the `center` and `scale` components stores the means and standard deviations of the variables that were used to scale the variables before the PCA.

In [None]:
# variable means
print(pca_result$center)

In [None]:
# variable standard deviations
print(pca_result$scale)

The `rotation` component stores the principal component loadings. In our case, we have 5 principal components (the `PCx`columns), and each has 5 loadings. (In general, the number of principal components is given by $min(n, p)$, where $n$ is the number of observations and $p$ is the number of variables.)

In [None]:
# PC loadings
print(pca_result$rotation)

In [None]:
# verify the squared loadings for each PC indeed sum up to 1
print(colSums(pca_result$rotation^2))

The `x` component of the returned list contains the scores.

In [None]:
# PC scores
pca_result$x

##3.3 Visualize some results

We can plot the first two principal components of the country risk data as below. This is a *biplot* as it displays both the scores and loadings of the first two principals.

In [None]:
# plot the scores and loadings for the first two PCs.
# the scale = 0 argument ensures that the arrows in the plot are scaled to represent the loadings.
biplot(pca_result, scale = 0, xlabs=country_risk[["Abbrev"]])

The data points (labeled by country abbreviations) are the scores for the first and second principal components. The red arrows indicate the first two principal's loading vectors (with axes on the top and right). For example, the loadings for `Corruption` on the first and second principal components are about 0.592 and 0.011 respectively (as given by `pca_result$rotation`). The `Corruption` arrow indicates the direction of (0.592, 0.011) and the red label `Corruption` is centered at (0.592, 0.011).

**Note:** The `scale` argument of the `biplot()` function must be set to `0` to ensure that the arrows are scaled to represent the loadings.

Overall, we see that the first principal component loading vector places most weights on `Corruption`, `legal` (both positive weights) and `Peace` (negative weights). The weights on `GDP Growth` and `Population` are small. Hence the first principal component roughly represents a weighted risk measure based on `Corruption`, `legal` and `Peace` variables.

Similarly, the second principal component places most weights on `GDP Growth` and `Population`, the economic and demographic measures of a country.

The `prcomp()` function also returns the standard deviation of each principal component. It's stored in the `sdev` component of the output list. We can use this information to plot variance explained and cumulative variance explained graphs.

In [None]:
# display standard deviation of each principal component
print(pca_result$sdev)

In [None]:
# display variance of each principal component
pca_var <- pca_result$sdev^2
print(pca_var)

In [None]:
# calculate proportion of variance explained for each principal component
prop_var_explained <- pca_var / sum(pca_var)
print(prop_var_explained)

In [None]:
par(mfrow = c(1, 2))

# plot proportion of variance explained
plot(prop_var_explained,
     xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1),
     type = "b")

# plot cumulative proportion of variance explained
plot(cumsum(prop_var_explained),
     xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1),
     type = "b")

We see that the first three principal components together explain about 90% of the total variance in the data. Depending on what you plan to do with the data, you may consider representing the original dataset (5D) using only its first three principal components (3D).