# Principal Component Analysis
# Written by Josue Barnes



Let’s say you have a data set with 15 features (imagine columns of these features) and you are having a hard time understanding how features correlate. Or maybe you are having trouble running algorithms on your dataset because you have too many features and it is time consuming to run. Or, you want to visualize your dataset set but recognize we are limited to viewing only 3 dimensions at given time. These are all examples of the curse of dimensionality. Having a large number of features that describe a sample are great, until we have to make sense of it all. 

So what can we do to manage high dimensionality of datasets? One of the easiest methods to employ is reducing the number of dimensions in our dataset (aka dimensionality reduction). Two common methods for dimensionality reduction include Principal Component Analysis (PCA) and t-Distribution Stochastic Neighbor Embedding (t-SNE). 

Today we are going to implement PCA so visualize the iris the iris data set, which contains iris flower data. Can someone describe the data set to me? It’s size, what features we have.

I want to note that I will not be diving into the math behind PCA. Simply its usefullness for dimensionality reduction and visualization. If you are interested in the math check out the following link: https://tgmstat.wordpress.com/2013/11/21/introduction-to-principal-component-analysis-pca/


In [1]:
#load the data
data(iris) 

#view features
head(iris)


Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


The data set consists of 4 continuous features and a class feature. Let’s make a variable containing our 4 continuous features of sepal.length, sepla.width, petal.length, and petal.width and one variable with the species of each sample. 

In [6]:
iris.data = iris[,1:4]
iris.species = iris[,5]

Now, let's apply PCA using the prcomp() function in R and let's save this to a variable so that we can take a look at summary information. 

In [7]:
iris.pca = prcomp(iris.data)
summary(iris.pca)

Importance of components:
                          PC1     PC2    PC3     PC4
Standard deviation     2.0563 0.49262 0.2797 0.15439
Proportion of Variance 0.9246 0.05307 0.0171 0.00521
Cumulative Proportion  0.9246 0.97769 0.9948 1.00000

We've performed out dimensionality reduction, but what we want to pay close attention to is which PCs give us the most information by having the largest variance. First lets look at the proportion of variance. PC1 has 92% of the variance, PC2 has 5%, and PC3 1%. When looking at the cumulative proportion we will want to use enough PC's so that this proportion is about 90%. This isn't a fixed number, it is very much up to the user to decide which principal components are most useful. In this case PC1 and PC2 give us what we need.

Now let's plot PC1 against PC2

In [8]:
library(ggplot2)
ggplot(data=iris.pca, aes(PC1,PC2, col=iris.species)) + geom_point(size=3.5) + theme_linedraw() +
geom_smooth(method = "lm", se = F)

ERROR: Error in ggplot(data = iris.pca, aes(PC1, PC2, col = iris.species)): could not find function "ggplot"


Hopefully what you see are 3 clusters, each representing a species of iris. The data has been transformed into a linear combination that maximizes the varaince.