Authors:
- Tina Jin
- Virginia Weston
- Jeffrey Bradley
- Taylor Tucker

## SVM model notebook

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.model_selection import validation_curve
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_curve, auc
from scipy import interpolate
from sklearn.utils import resample


In [2]:
df = pd.read_csv('cleaned_data_discreet.csv')
df.drop(columns = 'Unnamed: 0', inplace = True)
df

Unnamed: 0,Number of Bachelor's Degrees,Percent Financial Aid,Average Amount of Aid,Retention Rate,Enrollment,Percent Women,Percent In State,Percent Out of State,Percent Foreign,Percent Unknown,Graduation Rate,Percent Awarded,Total Staff,Instructional Staff,SA Staff,Librarian Staff,Percent Books,Percent Digital,Percent Admitted,Total Price
0,208.0,100.0,32400.0,79.0,996,99.0,59.0,36.0,4.0,0.0,69.0,66.0,357.0,105.0,56.0,62.0,41,12,70.0,"50,000-60,000"
1,310.0,100.0,40855.0,75.0,1533,54.0,66.0,32.0,1.0,0.0,64.0,61.0,435.0,132.0,21.0,27.0,37,54,68.0,"50,000-60,000"
2,398.0,100.0,39796.0,68.0,1912,60.0,53.0,46.0,1.0,0.0,51.0,48.0,355.0,123.0,17.0,21.0,28,13,62.0,"60,000-70,000"
3,382.0,100.0,38689.0,82.0,1771,56.0,50.0,45.0,4.0,0.0,74.0,70.0,426.0,160.0,41.0,50.0,27,46,64.0,"60,000-70,000"
4,61.0,97.0,10055.0,37.0,698,45.0,64.0,34.0,0.0,2.0,31.0,10.0,115.0,41.0,4.0,7.0,20,76,64.0,"20,000-30,000"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
220,71.0,100.0,9682.0,52.0,479,43.0,67.0,28.0,6.0,0.0,38.0,30.0,117.0,33.0,8.0,11.0,39,61,56.0,"20,000-30,000"
221,511.0,63.0,55897.0,99.0,2095,48.0,12.0,78.0,9.0,0.0,95.0,86.0,1133.0,342.0,44.0,83.0,28,56,13.0,"70,000-80,000"
222,363.0,100.0,29504.0,70.0,1757,55.0,79.0,21.0,1.0,0.0,62.0,61.0,373.0,142.0,13.0,17.0,49,22,74.0,"50,000-60,000"
223,379.0,98.0,31824.0,88.0,1666,53.0,55.0,42.0,3.0,0.0,81.0,77.0,440.0,152.0,23.0,34.0,14,60,64.0,"60,000-70,000"


In [3]:
x = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y, shuffle=True, test_size=0.3)

In [5]:
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))

In [6]:
param_range = [0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0, 500.0, 1000.0, 5000.0]

param_grid = [{'svc__C': param_range,
               'svc__kernel': ['linear']},
              {'svc__C': param_range,
               'svc__gamma': param_range,
               'svc__kernel': ['rbf']}]

In [7]:
gs = GridSearchCV(estimator=SVC(),
                  param_grid={'C': [0.1, 0.5, 1.0, 5.0, 10.0, 50.0], 'kernel': ('linear', 'rbf', 'poly','sigmoid'), 'gamma' :[0.1, 0.5, 1.0, 5.0, 10.0, 50.0] },
                  scoring=('accuracy'),
                  refit="f1",
                  cv=10,
                  n_jobs=-1, verbose=True)

In [8]:
SVC().get_params().keys()

dict_keys(['C', 'break_ties', 'cache_size', 'class_weight', 'coef0', 'decision_function_shape', 'degree', 'gamma', 'kernel', 'max_iter', 'probability', 'random_state', 'shrinking', 'tol', 'verbose'])

In [9]:
x_train_SD = StandardScaler().fit_transform(x_train)
x_test_SD = StandardScaler().fit_transform(x_test)

In [10]:
gs = gs.fit(x_train_SD, y_train)
print("Mean:")
print("Best Train Score (Accuracy, f1, precision, recall, and ROC_auc):", gs.best_score_)
print("Best Test Score:", gs.score(x_test_SD, y_test))
print("Best Params:", gs.best_params_)

Fitting 10 folds for each of 144 candidates, totalling 1440 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  62 tasks      | elapsed:    8.9s


Mean:
Best Train Score (Accuracy, f1, precision, recall, and ROC_auc): 0.5808333333333333
Best Test Score: 0.6323529411764706
Best Params: {'C': 1.0, 'gamma': 0.1, 'kernel': 'rbf'}


[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed:   17.8s finished


Poor results, lets try using less features

In [11]:
x = df[["Average Amount of Aid", "Graduation Rate"]]
y = df.iloc[:, -1]

In [12]:
x_train, x_test, y_train, y_test = train_test_split(x, y, shuffle=True, test_size=0.3)

In [13]:
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))

In [14]:
param_range = [0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0, 500.0, 1000.0, 5000.0]

param_grid = [{'svc__C': param_range,
               'svc__kernel': ['linear']},
              {'svc__C': param_range,
               'svc__gamma': param_range,
               'svc__kernel': ['rbf']}]

In [15]:
gs = GridSearchCV(estimator=SVC(),
                  param_grid={'C': [0.1, 0.5, 1.0, 5.0, 10.0, 50.0], 'kernel': ('linear', 'rbf', 'poly','sigmoid'), 'gamma' :[0.1, 0.5, 1.0, 5.0, 10.0, 50.0] },
                  scoring=('accuracy'),
                  refit="f1",
                  cv=10,
                  n_jobs=-1, verbose=True)

In [16]:
SVC().get_params().keys()

dict_keys(['C', 'break_ties', 'cache_size', 'class_weight', 'coef0', 'decision_function_shape', 'degree', 'gamma', 'kernel', 'max_iter', 'probability', 'random_state', 'shrinking', 'tol', 'verbose'])

In [17]:
x_train_SD = StandardScaler().fit_transform(x_train)
x_test_SD = StandardScaler().fit_transform(x_test)

In [18]:
gs = gs.fit(x_train_SD, y_train)
print("Mean:")
print("Best Train Score (Accuracy, f1, precision, recall, and ROC_auc):", gs.best_score_)
print("Best Test Score:", gs.score(x_test_SD, y_test))
print("Best Params:", gs.best_params_)

Fitting 10 folds for each of 144 candidates, totalling 1440 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 328 tasks      | elapsed:    1.4s
[Parallel(n_jobs=-1)]: Done 1222 tasks      | elapsed: 13.2min


Mean:
Best Train Score (Accuracy, f1, precision, recall, and ROC_auc): 0.5870833333333333
Best Test Score: 0.4411764705882353
Best Params: {'C': 1.0, 'gamma': 0.1, 'kernel': 'linear'}


[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed: 55.6min finished
