<img src = "https://ibm.box.com/shared/static/hhxv35atrvcom7qc1ngn632mkkg5d6l3.png", width = 200></img>

<h2, align=center> Toronto - Big Data University Meetup</h2>
<h1, align=center> Data Mining Algorithms</h1>
<h3, align=center> October 26, 2015</h3>
<h4, align=center><a href = "https://linkedin.com/in/polonglin">Polong Lin</a></h4>
<h4, align=center><a href = "https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a></h4>
<h4, align=center>R code by: <a href = "https://ca.linkedin.com/pub/konstantin-tskhay/b2/556/562">Konstantin Tskhay</a></h4>
<hr>

## Welcome to Data Scientist Workbench

Data Scientist Workbench is an environment that hosts multiple data science tools:
- Python notebooks (PySpark pre-installed)
- R notebooks (SparkR pre-installed)
- Scala notebooks (Spark pre-installed)
- <a href = "https://datascientistworkbench.com/rstudio">RStudio</a>
- <a href = "https://datascientistworkbench.com/openrefine">OpenRefine</a>


### Initial setup

In [None]:
%%bash
pip install rpy2

In [None]:
%load_ext rpy2.ipython

In [None]:
%%R
detach(package:SparkR)
R.Version()$version.string

Now we can run R within this Python notebook

<br>
<hr>

<h1, align=center>Principal Component Analysis</h1>

### Install packages for R

In [None]:
%%R
#Make take a while to install
#install.packages("psych")
#install.packages("ggplot2")
library(psych)
library(ggplot2)

### Getting the data

In [None]:
!wget -O new_prof_data.csv https://ibm.box.com/shared/static/iiskx4kggmwkt1a7po6vlmvefvoxm08z.csv

### Importing the data

The data are in .csv format: 'new_prof_data.csv'

Here is a brief description of the variables:


1. **ID** = Observation ID
2. **Prof.Name** = The name of the professor. Here, Name1 to Name213 are used
3. **Present** = “Presents the material in an organized, well-planned manner.”
4. **Explain** = “Explains concepts clearly and with appropriate use of examples.”
5. **Communi** = “Communicates enthusiasm and interest in the course material.”
6. **Teach** = “All things considered, performs effectively as a university teacher.”
7. **Workload** = “Compared to other courses at the same level, the workload is…”
8. **Difficulty** = “Compared to other courses at the same level, the level of difficulty of the material is…”
9. **learn.Exp** = “The value of the overall learning experience is…”
10. **Retake** = “Considering your experience with this course, and disregarding your need for it to meet program or degree requirements, would you still have taken this course?”
11. **Inf.** = The aggregate influence score (Interpersonal Charisma Scale)
12. **Kind** = The aggregate kindness score (Interpersonal Charisma Scale)


**_Notes._**

**Q3-Q6 scale**: 1 = extremely poor; 2 = very poor; 3 = poor; 4 = adequate; 5 = good; 6 = very good; 7 = outstanding

**Q7-Q9 scale**: 1 = very low; 2 = low; 3 = below average; 4 = average; 5 = above average; 6 = high; 7 = very high

**Q10 scale**: proportion of people out of 100 who would still take the course considering the experience

**Q11-Q12 scale**: “I am someone who is…”; 1 = strongly disagree; 2 = moderately disagree; 3 = neither agree nor disagree; 4 = moderately agree; 5 = strongly agree



In [None]:
%%R

## Read the data into an object named data
data <- read.csv('/resources/new_prof_data.csv')

## Examine data:
names(data)
str(data)
summary(data)
head(data)

##Extract necessary data for PCA

This step is simple: select relevant columns.
We want columns 3 through 8.

In [None]:
%%R
names(data) ## look up index
comp.data <- data[,3:8] ## extract data
names(comp.data) ## check

## Set up all of the preconditions

**Question**: Can we reduce the number of variables?
**Answer**: Yes. Let's do it.

  

To do so, we need to see interrelationships between the variables.
From before, we know that there may be 2 types of variables emerging.

1. The variables that specify how good the professors are at communication
2. The variables that track the course difficulty

Let us examine whether this may be the case

In [None]:
%%R
round(cor(comp.data), digits = 3) ## produces correlation matrix

You can see immediately that all communication variables are highly correlated. The difficulty variable correlates quite highly with the workload variable. However, there appears to be little overlap between communication and workload/difficulty variables.

**This suggests that there are probably 2 components/factors in our data.**


### Decision Rules:
1. Probably 2 components (Communication, Workload)
2. The components are probably orthogonal
3. Check it empirically

## Run PCA/Visualizations & Interpret the output

In [None]:
%%R
pcaSolution <- prcomp(comp.data, center = TRUE, scale. = TRUE) 

## Produced the object with standard deviatons of all variables
## and Rotation -- or the loadings on the principal components

print(pcaSolution) ## prints PCA solution

pcaSolution$sdev^2

### Let's create the Scree plot: Variance explained versus components

In [None]:
%%R
plot(pcaSolution, type = "l", pch = 16, lwd = 2) ## generates a scree plot

This figure will help us to decide how many components we should extract.
The first two PC explain most of the variability in the data--so, probably 2 components to extract.

###How many components to keep?

####Kaiser-Guttman Rule
The number of factors extracted equals the number of factors with eigenvalues greater than 1

####Percentage of Common Variance
The number of factors retained should have a cumulative variance explained should be at least 50% of the variance, but most people go for 75% and the ideal is 90%

####Scree Test
The number of factors retained should be the last number before the rate of change in eigenvalues levels off


<br>

<hr>
## RESOURCES:


### Useful Links:

- **Data Science** http://bigdatauniversity.com
- **Clustering** http://bigdatauniversity.com/bdu-wp/bdu-course/machine-learning-cluster-analysis/
- **R-Code** http://www.statmethods.net/advstats/factor.html
- **Visualize** http://www.r-bloggers.com/computing-and-visualizing-pca-in-r/

### Books:
- **Principal Component Analysis** http://www.amazon.ca/Principal-Components-Analysis-George-Dunteman/dp/0803931042/ref=sr_1_2?ie=UTF8&qid=1444011812&sr=8-2&keywords=principal+component+analysis

### Uses in Measurement:
- http://scholarship.sha.cornell.edu/cgi/viewcontent.cgi?article=1618&context=articles
- http://personal.stevens.edu/~ysakamot/719/week4/scaledevelopment.pdf
- http://scholarship.sha.cornell.edu/cgi/viewcontent.cgi?article=1515&context=articles