In this project we will perform svm with different kernels on a UCI machine learning repository dataset called "Breast Cancer Coimbra Data Set". There are 10 quantitative attributes: Age,BMI,Glucose,Insulin,HOMA,Leptin,Adiponectin,Resistin,and MCP-1.
There is also an 11th feature which is nominal called Label,which indicates if a patient is healthy or has cancer. There are 116 instances in this dataset. 
To begin,first we'll import the necessary packages for different parts of the project.

In [1]:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import classification_report , confusion_matrix 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

Now, we'll import the dataset using pandas' read_csv method, seperate the 10 features from the labels,save the features in x variable and the labels in y and perform two train test splits with train size=0.7, validation size=0.15 and test size=0.15 to obtain train,test,and validation sets necessary for the project.  

In [2]:
data = pd.read_csv('dataR2.csv')
x=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
X_train1, X_test, y_train1, y_test = train_test_split(x, y, test_size=0.15, random_state=0)
X_train,X_valid,y_train,y_valid = train_test_split(X_train1,y_train1,test_size=0.15,random_state=0 )

Now,we'll use sklearn's SVC class,build an SVC object, and fit the model on our train set. We'll use three different kernels: Polynomial,Gaussian,and Sigmoid.For each kernel,we'll use the test set to predict the samples, then we'll calculate the confusion matrix and the evaluation metrics accuracy,precision,recall,and F1-score using sklearn's metric library.
Finally, we'll compare these metrics for different kernels against each other and choose the kernel that has the highest overall result. For fairness and better comparison, we'll set the value of the soft margin parameter C to 1 in every kernel and the gamma parameter to 0.0001 in Gaussian and Sigmoid kernels.

<h1>Polynomial Kernel</h1>

First we'll implement svm using polynomial kernel and fit the model to train set. We'll set the degree to 8(because the default degree value in sklearn(3)causes the code to run very slowly). 

In [3]:
svclassifier = SVC(kernel='poly',degree=8,C=1)
svclassifier.fit(X_train,y_train)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=8, gamma='auto_deprecated',
    kernel='poly', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

The next step is to use the test set to make predictions.

In [4]:
y_pred = svclassifier.predict(X_test)
print('Real y values: ', y_test)
print('Predicted y values: ', y_pred)

Real y values:  [1 2 2 1 1 2 2 2 2 2 1 1 2 1 1 2 1 2]
Predicted y values:  [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


As we can see, some labels in test set are 1 and some are 2, but polynomial kernel has falsely predicted each label to be 1.To get a better view of the performance, we'll calculate the confusion matrix,accuracy,precision,recall,and F1-score.

In [5]:
print('Confusion Matrix:')
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

Confusion Matrix:
[[ 8  0]
 [10  0]]
              precision    recall  f1-score   support

           1       0.44      1.00      0.62         8
           2       0.00      0.00      0.00        10

    accuracy                           0.44        18
   macro avg       0.22      0.50      0.31        18
weighted avg       0.20      0.44      0.27        18



Let's repeat the process for Gaussian(rbf) and Sigmoid kernels and compare the results.

<h1>Gaussian Kernel</h1>

Implementing svm with rbf kernel and fitting on train set:

In [6]:
svclassifier = SVC(kernel='rbf',gamma=0.0001,C=1)
svclassifier.fit(X_train,y_train)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Making predictions:

In [7]:
y_pred = svclassifier.predict(X_test)
print('Real y values: ', y_test)
print('Predicted y values: ', y_pred)

Real y values:  [1 2 2 1 1 2 2 2 2 2 1 1 2 1 1 2 1 2]
Predicted y values:  [2 2 1 2 2 2 2 2 2 2 1 1 2 1 2 1 1 1]


It seems that this kernel performed better. Let's take a closer look:

In [8]:
print('Confusion Matrix:')
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

Confusion Matrix:
[[4 4]
 [3 7]]
              precision    recall  f1-score   support

           1       0.57      0.50      0.53         8
           2       0.64      0.70      0.67        10

    accuracy                           0.61        18
   macro avg       0.60      0.60      0.60        18
weighted avg       0.61      0.61      0.61        18



<h1>Sigmoid Kernel</h1>

Training the data with sigmoid kernel:

In [9]:
svclassifier = SVC(kernel='sigmoid',gamma=0.0001,C=1)
svclassifier.fit(X_train,y_train)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='sigmoid',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Making predictions:

In [10]:
y_pred = svclassifier.predict(X_test)
print('Real y values: ', y_test)
print('Predicted y values: ', y_pred)

Real y values:  [1 2 2 1 1 2 2 2 2 2 1 1 2 1 1 2 1 2]
Predicted y values:  [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]


Similar to the Polynomial kernel, this kernel falsely predicted every label to be 2.

In [11]:
print('Confusion Matrix:')
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

Confusion Matrix:
[[ 0  8]
 [ 0 10]]
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         8
           2       0.56      1.00      0.71        10

    accuracy                           0.56        18
   macro avg       0.28      0.50      0.36        18
weighted avg       0.31      0.56      0.40        18



It's time to make comparisons on the evaluation results and choose the best kernel. We'll begin with the Confusion Matrices. The matrix for polynomial kernel has 8 true positives and 10 false positives. So, 8 labels were predicted correctly and 10 labels were predicted falsely.That means that the kernel has correctly detected 8 patients to be healthy,but has falsely detected 10 patients as healthy while they had cancer.
The matrix for sigmoid kernel has 8 false negatives and 10 true negatives.So, 10 labels were predicted correctly and 8 labels were predicted falsely. That means that the kernel has falsely detected 8 patients to have cancer while they were healthy,and has correctly detected 10 patients to be healthy.
So, sigmoid has performed better than polynomial.
Lastly, the rbf matrix has 4 true positives and 7 true negatives and 4 false negatives and 3 false positives. That means that 11 labels were predicted correctly and 7 were predicted falsely. So the Gaussian kernel predicted 1 more label correctly compared to sigmoid kernel.
So, Gaussian has performed better than sigmoid.
In terms of accuracy score, again we see that polynomial kernel has the lowest accuracy(0.44),sigmoid is better than polynomial(0.55),and gaussian has the highest accuracy(0.61).
For precision,recall,and F1-scores we will compare the macro average.
In terms of precision,polynomial kernel has the lowest precision(0.22),sigmoid is better(0.28),and gaussian has the highest precision(0.6).
In terms of recall,polynomial and sigmoid kernels have equal recall(0.5),and gaussian has the highest recall(0.6).
In terms of F1-score,polynomial kernel has the lowest score(0.31),sigmoid is better(0.36),and gaussian has the highest score(0.6).

So with all the comparisons being made on all evaluation metrics, the Gaussian kernel has the best performance among all the kernels as it has the highest evaluation scores. Therefore we shall proceed with it through the entire project.

The next part asks us to implement svm with the ideal kernel, each time changing the soft margin hyper parameter C and see what difference it makes to the evaluation metrics each time. 
First,we'll specify some C values and store them in an array:

In [12]:
C = np.array([0.0001,0.001,0.01,0.1,1, 10, 100, 1000])
print('Different C values: ',C)

Different C values:  [1.e-04 1.e-03 1.e-02 1.e-01 1.e+00 1.e+01 1.e+02 1.e+03]


Then, for each C value,we'll fit svm with Gaussian kernel on train set,predict the values using the test set,and calculate the confusion matrix,accuracy,precision,recall,and F1-score metrics. In the end,we'll compare the metrics for different C values:

In [13]:
for c in C:
    svclassifier = SVC(kernel='rbf',gamma=0.0001,C=c)
    svclassifier.fit(X_train,y_train)
    y_pred = svclassifier.predict(X_test)
    print("C= ",c) 
    print('Real y values: ',y_test)
    print('Predicted y values: ',y_pred)
    print('Confusion Matrix: ')
    print(confusion_matrix(y_test,y_pred))
    print(classification_report(y_test,y_pred))

C=  0.0001
Real y values:  [1 2 2 1 1 2 2 2 2 2 1 1 2 1 1 2 1 2]
Predicted y values:  [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
Confusion Matrix: 
[[ 0  8]
 [ 0 10]]
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         8
           2       0.56      1.00      0.71        10

    accuracy                           0.56        18
   macro avg       0.28      0.50      0.36        18
weighted avg       0.31      0.56      0.40        18

C=  0.001
Real y values:  [1 2 2 1 1 2 2 2 2 2 1 1 2 1 1 2 1 2]
Predicted y values:  [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
Confusion Matrix: 
[[ 0  8]
 [ 0 10]]
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         8
           2       0.56      1.00      0.71        10

    accuracy                           0.56        18
   macro avg       0.28      0.50      0.36        18
weighted avg       0.31      0.56      0.40        18

C=  0.01
Real y values:

From the results we obtain the fact that increasing the soft margin value from 0.0001 up to 100 improves the accuracy,precision,recall,and F1 scores and decreases the number of falsely predicted labels in the confusion matrix. However, increasing the C from 100 to 1000 decreases the scores and increases the number of falsely predicted values in the confusion matrix which suggests that we may be witnessing an overfit. Overall, increasing the C parameter improves the model's performance and increases the overall accuracy which is visible in the evaluation metrics as well.  

The next part asks us to change the train set size each time and see how the evaluation metrics vary.
To do that, we'll store different test set sizes in an array and use a loop to perform the train test validation split with different train sizes each time(Train sizes of 0.1 to 0.9).For each iteration in the loop,we'll fit svm with Gaussian kernel on the train set,predict the samples using the test set and calculate the evaluation metrics.We'll then compare the metrics obtained from each iteration to each other.

In [14]:
test_size = np.array([0.45,0.4,0.35,0.3,0.25,0.2,0.15,0.1,0.05])
for t in test_size:
    X_train1, X_test, y_train1, y_test = train_test_split(x, y, test_size= t, random_state=0)
    X_train,X_valid,y_train,y_valid = train_test_split(X_train1,y_train1,test_size= t,random_state=0 )
    svclassifier = SVC(kernel='rbf',gamma=0.0001,C=100)
    svclassifier.fit(X_train,y_train)
    y_pred = svclassifier.predict(X_test)
    print("train size= ",round(1-(2*t),2))
    print('Real y values: ',y_test)
    print('Predicted y values: ',y_pred)
    print('Confusion Matrix: ')
    print(confusion_matrix(y_test,y_pred))
    print(classification_report(y_test,y_pred))

train size=  0.1
Real y values:  [1 2 2 1 1 2 2 2 2 2 1 1 2 1 1 2 1 2 2 2 1 1 1 2 2 2 2 2 1 2 1 1 1 1 1 2 2
 2 2 2 2 2 1 1 1 2 2 2 2 1 1 2 1]
Predicted y values:  [1 1 2 2 2 2 1 2 1 1 2 2 1 2 1 2 2 1 1 1 1 2 2 2 1 1 1 1 2 2 2 1 2 1 2 2 2
 1 1 1 2 2 2 1 1 1 1 1 1 2 2 1 1]
Confusion Matrix: 
[[ 8 15]
 [20 10]]
              precision    recall  f1-score   support

           1       0.29      0.35      0.31        23
           2       0.40      0.33      0.36        30

    accuracy                           0.34        53
   macro avg       0.34      0.34      0.34        53
weighted avg       0.35      0.34      0.34        53

train size=  0.2
Real y values:  [1 2 2 1 1 2 2 2 2 2 1 1 2 1 1 2 1 2 2 2 1 1 1 2 2 2 2 2 1 2 1 1 1 1 1 2 2
 2 2 2 2 2 1 1 1 2 2]
Predicted y values:  [2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
 1 2 2 2 2 2 2 2 1 2]
Confusion Matrix: 
[[ 2 18]
 [ 3 24]]
              precision    recall  f1-score   support

           1       0.4

From the results we obtain the fact that increasing the size of the train set from 0.1 up to 0.7 improves the accuracy,precision,recall,and F1 scores and decreases the number of falsely predicted labels in the confusion matrix. However, increasing the size from 0.7 to 0.9 decreases the scores and increases the number of falsely predicted values in the confusion matrix which suggests that we may be witnessing an overfit. Overall, increasing the train set size improves the model's performance and increases the overall accuracy which is visible in the evaluation metrics as well.

In the last part,we are asked to perform Grid Search on svm with Gaussian kernel in order to find the best hyper parameters(and overall do hyper parameter tuning)and report the results.
We'll use sklearn's model selection library and import GridSearchCV to perform the task.
First,we should specify some values for the hyper parameters and store them in the param_grid array.GridSearchCV will use this array to search in our specified values and fit svm on the train set trying different hyper parameter values and will calculate the accuracy score and in the end,it shall report the best hyper parameter values for our model.

To see the effect of Grid Search more clearly,first we'll fit svm on the train set without hyper parameter tuning, predict the results using the test set and calculate the evaluation metrics. We'll then repeat the process using GridSearchCV and compare the results.

In [15]:
X_train1, X_test, y_train1, y_test = train_test_split(x, y, test_size=0.15, random_state=0)
X_train,X_valid,y_train,y_valid = train_test_split(X_train1,y_train1,test_size=0.15,random_state=0 )
svclassifier = SVC(kernel='rbf')
svclassifier.fit(X_train,y_train)
y_pred = svclassifier.predict(X_test)
print('Real y values: ',y_test)
print('Predicted y values: ',y_pred)
print("Confusion Matrix: ")
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

Real y values:  [1 2 2 1 1 2 2 2 2 2 1 1 2 1 1 2 1 2]
Predicted y values:  [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
Confusion Matrix: 
[[ 0  8]
 [ 0 10]]
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         8
           2       0.56      1.00      0.71        10

    accuracy                           0.56        18
   macro avg       0.28      0.50      0.36        18
weighted avg       0.31      0.56      0.40        18



We can see that in the confusion matrix we have 10 correctly predicted lables and 8 falsely predicted labels. The accuracy score is 0.56,precision is 0.28,recall is 0.5,and F1 score is 0.36.
Let's see if we can improve the scores using Grid Search.

First we'll specify some C and gamma values(which are the only hyper parameters in the rbf kernel) and set our kernel type to rbf:

In [16]:
param_grid = {'C': [0.1,1.0, 10, 100, 1000],'gamma': [1,0.1,0.01,0.001,0.0001] ,'kernel': ['rbf']} 

Now we'll call GridSearchCV and pass the param_grid array to it,set the number of cross validation folds to 6,then fit the grid search to the train set,and finally report the best hyper parameter values for our rbf model:

In [17]:
grid = GridSearchCV(SVC(),param_grid,refit=True,cv=6)
grid.fit(X_train,y_train)
grid.best_params_
print(grid.best_params_)
print(grid.best_estimator_)

{'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)


We can observe that the best C value found by Grid Search is 100 and the best gamma value is 0.0001.
To test that and see if our model accuracy improves, we'll fit the tuned grid model with the best found hyper parameters on the test set,then we'll calculate the evaluation metrics and see if our result has improved.

In [18]:
y_grid = grid.predict(X_test)
print('Real y values: ',y_test)
print('Predicted y values: ',y_grid)
print("Confusion Matrix: ")
print(confusion_matrix(y_test,y_grid))
print(classification_report(y_test,y_grid))

Real y values:  [1 2 2 1 1 2 2 2 2 2 1 1 2 1 1 2 1 2]
Predicted y values:  [1 2 2 1 1 1 2 2 2 2 1 2 2 1 1 2 1 1]
Confusion Matrix: 
[[7 1]
 [2 8]]
              precision    recall  f1-score   support

           1       0.78      0.88      0.82         8
           2       0.89      0.80      0.84        10

    accuracy                           0.83        18
   macro avg       0.83      0.84      0.83        18
weighted avg       0.84      0.83      0.83        18



From the results we can see that the confusion matrix now has 15 correct predictions and only 3 labels are predicted falsely. We have achieved an accuracy score of 0.83,precision of 0.83,recall of 0.84,and F1 score of 0.83 and so, the evaluation results confirm that Grid Search has indeed found the best hyper parameter values for Gaussian kernel(C=100 and gamma=0.0001) and the model performance has improved.