### In this exercise, we will predict the number of applications received using the other variables in the College data set in the ISLR2 package. ** be sure to look closely at this data, you may want to consider the multi-scale nature of the problem, and perhaps use a transformation on some of the variables.**

In [20]:
rm()

In [21]:
library(ISLR2)
data(College)
#head(College)
dim(College)

In [22]:
# install.packages("caret")
library(caret)

#### 1. a) Split the data set into a training set and a test set. 

In [23]:
#1 (a) Split the data set into a training set and a test set. 

set.seed(1)
college_data <- College

train_indis <- sample(1:nrow(college_data), size = round(2/3*nrow(college_data)), replace = FALSE )

training_data <- college_data[train_indis,]
test_data <- college_data[-train_indis,]

y_true_train <- training_data$Apps
y_true_test <- test_data$Apps

#### Fit a linear model using least squares on the training set, and report the test error obtained.

In [24]:
#Linear regression model 
linear_model <- lm(Apps ~ ., data = training_data)
summary(linear_model)

#Fitting trainning model on test set
pred_linear <- predict(linear_model, new_data = test_data)

#Test error
test_err <- sum((y_true_test - pred_linear)^2) #sum of residual squares error
mean_squared_err <- mean((y_true_test - pred_linear)^2) #mean squared error

print(paste("Residual sum of squares error :", test_err, sep = " "))
print(paste("Mean Squared error :", mean_squared_err, sep = " "))


Call:
lm(formula = Apps ~ ., data = training_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5858.9  -454.1    -3.1   346.8  7322.1 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -681.36179  523.92530  -1.300  0.19403    
PrivateYes  -471.63266  168.93660  -2.792  0.00544 ** 
Accept         1.71602    0.04853  35.363  < 2e-16 ***
Enroll        -0.99656    0.25372  -3.928 9.78e-05 ***
Top10perc     56.15705    6.65558   8.438 3.51e-16 ***
Top25perc    -16.87251    5.23835  -3.221  0.00136 ** 
F.Undergrad    0.02392    0.04498   0.532  0.59518    
P.Undergrad    0.07129    0.03705   1.924  0.05489 .  
Outstate      -0.09104    0.02233  -4.077 5.31e-05 ***
Room.Board     0.18008    0.05844   3.081  0.00218 ** 
Books          0.38276    0.32701   1.170  0.24237    
Personal      -0.01392    0.07506  -0.186  0.85290    
PhD          -12.89843    5.70565  -2.261  0.02421 *  
Terminal       1.65090    6.22623   0.265  0.79100    
S.F.Ratio     

[1] "Residual sum of squares error : 14256206623.2306"
[1] "Mean Squared error : 27521634.4077811"


In [25]:
#install.packages("glmnet")
library(glmnet)

#### 1. b) Fit a ridge regression model on the training set, with λ chosen by crossvalidation. Report the test error obtained.

In [26]:
#Converting training and test data to matrix as cv.glment takes matrix as input
x_train <- model.matrix(Apps~.,training_data)[,-1]
x_test <- model.matrix(Apps~.,test_data)[,-1]


#Choosing the tuning parameter using cross validation
set.seed(2)
cv_out_ridge <- cv.glmnet(x_train, y_true_train, alpha = 0)
#plot(cv_out_ridge)
bestlam_ridge <- cv_out_ridge$lambda.min


#Ridge regression model 
ridge_model <- glmnet(x_train, y_true_train, alpha = 0, lambda = bestlam_ridge)
#summary(ridge_model)


#Fitting trainning model on test set
pred_ridge <- predict(ridge_model, s = bestlam_ridge, newx = x_test)


#Test error
test_err_ridge <- sum((pred_ridge - y_true_test)^2)  #sum of residual squares error
mean_squared_err_ridge <- mean((pred_ridge - y_true_test)^2) #mean squared error

print(paste("Residual sum of squares error :", test_err_ridge, sep = " "))
print(paste("Mean Squared error :", mean_squared_err_ridge, sep = " "))



[1] "Residual sum of squares error : 288284960.387111"
[1] "Mean Squared error : 1113069.34512398"


#### 1. d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

In [27]:
#Choosing the tuning parameter using cross validation
set.seed(3)
cv_out_lasso <- cv.glmnet(x_train, y_true_train, alpha = 1)
#plot(cv_out_lasso)
bestlam_lasso <- cv_out_lasso$lambda.min


#Lasso model 
lasso_model <- glmnet(x_train, y_true_train, alpha = 1, lambda = bestlam_lasso)
#summary(lasso_model)


#Fitting trainning model on test set
pred_lasso <- predict(lasso_model, s = bestlam_lasso, newx = x_test)


#Test error
test_err_lasso <- sum((pred_lasso - y_true_test)^2)  #sum of residual squares error
mean_squared_lasso <- mean((pred_lasso - y_true_test)^2) #mean squared error

print(paste("Residual sum of squares error :", test_err_lasso, sep = " "))
print(paste("Mean Squared error :", mean_squared_lasso, sep = " "))

#Lasso coefficients
lasso_coef <- predict(lasso_model, type = "coefficients", s = bestlam_lasso)[1:length(lasso_model$beta),]
#Non-zero coefficients
nonzero_coef <- lasso_coef[lasso_coef!=0]
print("Non-zero coefficient estimates are :")
print(nonzero_coef)

[1] "Residual sum of squares error : 304910441.822123"
[1] "Mean Squared error : 1177260.39313561"
[1] "Non-zero coefficient estimates are :"
  (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
-747.93075081 -409.30830511    1.63784037   -0.62491795   46.99407816 
    Top25perc   P.Undergrad      Outstate    Room.Board         Books 
 -10.41045253    0.05029946   -0.07347353    0.16322017    0.29906589 
          PhD     S.F.Ratio        Expend 
  -9.87016666   14.78812206    0.05606723 


#### 1. (g) Comment more generally on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these three approaches?

In [99]:
totalSumOFSquares = sum((mean(test_data$Apps) - test_data$Apps)^2)
totalSumOfResidualSquares = sum((pred_ridge - test_data$Apps)^2)
accuracy <- (1 - (totalSumOfResidualSquares)/(totalSumOFSquares))
print(accuracy)

[1] 0.9219198


**Yes, we can predict the number of college applications recieed with an accuracy of 92.19198%**