# Linear SVC Assignment

In [1]:
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

### Import the admissions data set (admissions.csv).

In [2]:
data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/admissions.csv')
data.head()

Unnamed: 0,GRE,TOEFL,SchoolRank,SOP,LOR,GPA,Research,Admitted
0,337,118,4,4.5,4.5,9.65,1,1
1,324,107,4,4.0,4.5,8.87,1,1
2,316,104,3,3.0,3.5,8.0,1,1
3,322,110,3,3.5,2.5,8.67,1,1
4,314,103,2,2.0,3.0,8.21,0,0


### Split the data into training and test sets, with the test set comprising 30% of the data.  Use `'Admitted'` as the target.

In [3]:
y = data.Admitted
X = data.drop(columns=['Admitted'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

### Generate an SVC model with a linear kernel. Set the regularization parameter (C) = 10. Check the score for both train and test sets. 

In [4]:
from sklearn.svm import SVC

svm = SVC(C=10, kernel='linear')
svm.fit(X_train, y_train)

print("Training Set Score: ", svm.score(X_train, y_train))
print("Testing Set Score: ", svm.score(X_test, y_test))

Training Set Score:  0.8785714285714286
Testing Set Score:  0.8666666666666667


### Choose some other values for C and show the difference between the scores for the train and test sets.

In [5]:
from sklearn.svm import SVC

svm = SVC(C=1000, kernel='linear')
svm.fit(X_train, y_train)

print("Training Set Score: ", svm.score(X_train, y_train))
print("Testing Set Score: ", svm.score(X_test, y_test))

Training Set Score:  0.8321428571428572
Testing Set Score:  0.8583333333333333


### What if we switched up the target variable? Let assume that we know whether a student was admitted. Let's try to predict what their SchoolRank was. 

Create an SVC model with a linear kernel with the SchoolRank field as the target variable. Report both the train and the test scores.

In [6]:
y = data.SchoolRank
X = data.drop(columns=['SchoolRank'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

svm2 = SVC(C=10, kernel='linear')
svm2.fit(X_train, y_train)

print("Training Set Score: ", svm2.score(X_train, y_train))
print("Testing Set Score: ", svm2.score(X_test, y_test))

Training Set Score:  0.6357142857142857
Testing Set Score:  0.55


### Show confusion matrices for the training and test sets, and a classification report for the test set. What trends do you notice?

In [7]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred_train = svm2.predict(X_train)
confusion = confusion_matrix(y_train, y_pred_train)
print(confusion)

[[11  5  2  0  0]
 [ 8 43 22  1  0]
 [ 0 14 81  4  1]
 [ 0  3 10 17 19]
 [ 0  0  3 10 26]]


In [8]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred_test = svm2.predict(X_test)
confusion = confusion_matrix(y_test, y_pred_test)
print(confusion)

[[ 2  5  1  0  0]
 [ 2 17 13  0  1]
 [ 0  4 26  2  1]
 [ 0  1  9 12  3]
 [ 0  0  4  8  9]]


In [9]:
report = classification_report(y_test, y_pred_test)
print(report)

              precision    recall  f1-score   support

           1       0.50      0.25      0.33         8
           2       0.63      0.52      0.57        33
           3       0.49      0.79      0.60        33
           4       0.55      0.48      0.51        25
           5       0.64      0.43      0.51        21

    accuracy                           0.55       120
   macro avg       0.56      0.49      0.51       120
weighted avg       0.57      0.55      0.54       120



Wherever there are less observations, the model is less able to accurately predict for that respective class.