# Activity 1

## Part 1: Explore the dataset from RateMyProfessor.com

In [None]:
# import library
library(ggplot2) # for visualization

In [None]:
# import data
dat = read.csv('prof_train.csv')

In [None]:
head(dat) # display the first few rows of data

In [None]:
str(dat)  # structure of data, you can see the type of variables in the dataset

### The attribute "area" only has one value "Midwest". It is not informative so we may remove it to reduce the number of dimensions. 

In [None]:
# For simplicity, we will not use the Factor attributes in this dataset
dat1 = dat[, c(2,3,4,7,8,9,10,11)]  # Define a new dataset with only the numerical variables (no factors)
head(dat1)

In [None]:
summary(dat1) # the range of each variable

In [None]:
pairs(dat1) # explore the scatterplot matrix

### From the plot above, we can see some pretty strong linear relationship between variables

In [None]:
cor(dat1$quality, dat1$helpfulness) # calculate the correlation between "quality" and "helpfulness"

In [None]:
cor(dat1)  # calculate the correlation of each pair of variables. 

## Part 2: Build a prediction model

### Next we will build a prediction model for the "quality" rating of a professor and then test the performance of our prediction model. 

### Let's set "quality" aside since it is the response variable in our prediction model. The rest of the variables are the attributes we may consider reducing the dimensions of. 

### "helpfulness" seems to be a good indicator for "quality", considering the high correlation between "quality" and "helpfulness". 


In [None]:
mod0 = lm(quality ~ helpfulness, data = dat1)  # regression model

In [None]:
ggplot(data = dat1, aes(x = helpfulness, y = quality)) + 
  geom_point(color='red') +
  geom_smooth(method = "lm", se = FALSE)

### In the plot, the red dots represent the pairs of (helpfulness, quality) from dat1. The blue line represents our regression model. 

### We've reduced the dimensions from 7 (numYears, numRates, numCourses, helpfulness, clarity, easiness, raterInterest) to 1 (helpfulness)! Next we will see some more systematic methods for dimension reduction. 

## Part 3: Make predictions on 40 professors

In [None]:
test = read.csv('prof_test.csv')
head(test) # we don't have "quality" rating, need to predict it. 

In [None]:
predict(mod0, test)

### Voila, these are the predicted "quality" ratings for the 76 professors! Let's compare them to the truth (the real "quality" rating for these professors. )

In [None]:
truth = read.csv('prof_test_truth.csv')

In [None]:
head(truth)

### We use the average of the | predicted quality - truth quality | (absolute value) to measure the performance of our model. 

In [None]:
mean(abs(truth$quality - predict(mod0, test)))

### Keep this value in mind as we evaluate the performance of other models later. 