# Linear SVC Assignment

In [1]:
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

### Import the admissions data set (admissions.csv).

In [2]:
data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/admissions.csv')
data.head()

Unnamed: 0,GRE,TOEFL,SchoolRank,SOP,LOR,GPA,Research,Admitted
0,337,118,4,4.5,4.5,9.65,1,1
1,324,107,4,4.0,4.5,8.87,1,1
2,316,104,3,3.0,3.5,8.0,1,1
3,322,110,3,3.5,2.5,8.67,1,1
4,314,103,2,2.0,3.0,8.21,0,0


### Split the data into training and test sets, with the test set comprising 30% of the data.  Use `'Admitted'` as the target.

In [3]:
y = data.Admitted
X = data.drop(columns=['Admitted'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

### Generate an SVC model with a linear kernel. Set the regularization parameter (C) = 10. Check the score for both train and test sets. 

In [4]:
from sklearn.svm import SVC

svm = SVC(C=10, kernel='linear')
svm.fit(X_train, y_train)

print("Training Set Score: ", svm.score(X_train, y_train))
print("Testing Set Score: ", svm.score(X_test, y_test))

Training Set Score:  0.875
Testing Set Score:  0.8666666666666667


### Choose some other values for C and show the difference between the scores for the train and test sets.

In [5]:
from sklearn.svm import SVC

svm = SVC(C=1000, kernel='linear')
svm.fit(X_train, y_train)

print("Training Set Score: ", svm.score(X_train, y_train))
print("Testing Set Score: ", svm.score(X_test, y_test))

Training Set Score:  0.8714285714285714
Testing Set Score:  0.8833333333333333


### What if we switched up the target variable? Let assume that we know whether a student was admitted. Let's try to predict what their SchoolRank was. 

Create an SVC model with a linear kernel with the SchoolRank field as the target variable. Report both the train and the test scores.

In [6]:
y = data.SchoolRank
X = data.drop(columns=['SchoolRank'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

svm = SVC(C=10, kernel='linear')
svm.fit(X_train, y_train)

print("Training Set Score: ", svm.score(X_train, y_train))
print("Testing Set Score: ", svm.score(X_test, y_test))

Training Set Score:  0.6535714285714286
Testing Set Score:  0.575


### Show confusion matrices for the training and test sets, and a classification report for the test set. What trends do you notice?

In [7]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred_train = svm.predict(X_train)
confusion = confusion_matrix(y_train, y_pred_train)
print(confusion)

[[ 2 12  1  0  0]
 [ 0 58 19  0  1]
 [ 0 11 73  3  1]
 [ 0  3 14 26 12]
 [ 0  0  7 13 24]]


In [8]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred_test = svm.predict(X_test)
confusion = confusion_matrix(y_test, y_pred_test)
print(confusion)

[[ 0 11  0  0  0]
 [ 0 18 10  0  1]
 [ 0 12 30  3  0]
 [ 0  1  5  9  4]
 [ 0  0  1  3 12]]


In [9]:
report = classification_report(y_test, y_pred_test)
print(report)

              precision    recall  f1-score   support

           1       0.00      0.00      0.00        11
           2       0.43      0.62      0.51        29
           3       0.65      0.67      0.66        45
           4       0.60      0.47      0.53        19
           5       0.71      0.75      0.73        16

    accuracy                           0.57       120
   macro avg       0.48      0.50      0.48       120
weighted avg       0.54      0.57      0.55       120



  _warn_prf(average, modifier, msg_start, len(result))


Because there are less observations of 'SchoolRank' == 5, the model is less able to accurately predict for that class.