UCI ML Breast Cancer Wisconsin (Diagnostic) datasets-Breast Cancer dataset Scikit -EDA and Predictive Modelling (SVM)

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [17]:
from sklearn import datasets
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.svm import SVC
from sklearn import metrics

Import the dataset and setting features and labels

In [10]:
BCskl=datasets.load_breast_cancer()
dataset_features=pd.DataFrame(BCskl['data'],columns=BCskl['feature_names'])
dataset_label=pd.Series(BCskl['target'],name='Cancer Benign?')

Splitting dataset to train and test

In [11]:
ftrain,ftest,ltrain,ltest=train_test_split(dataset_features,dataset_label,test_size=0.33,random_state=111)

Building predictive model using SVM ( no tuning the parameters)

In [12]:
mySVC=SVC()
mySVC.fit(ftrain,ltrain)
predictions=mySVC.predict(ftest)



Evaluating performance

In [14]:
print(metrics.classification_report(ltest,predictions))
print('\n')
print(metrics.confusion_matrix(ltest,predictions))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        75
           1       0.60      1.00      0.75       113

    accuracy                           0.60       188
   macro avg       0.30      0.50      0.38       188
weighted avg       0.36      0.60      0.45       188



[[  0  75]
 [  0 113]]


  'precision', 'predicted', average, warn_for)


**our model is performing poorly for identifying malignent cancers as 0 cases are predicted**

so tuning the model parameters C and Gamma

In [18]:
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']} 
myGS_SVC=SVC()
grid=GridSearchCV(myGS_SVC,param_grid,refit=True,verbose=5)

In [21]:
grid.fit(ftrain,ltrain)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s remaining:    0.0s


Fitting 3 folds for each of 25 candidates, totalling 75 fits
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.641, total=   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.638, total=   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.643, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.641, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.638, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.643, total=   0.0s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] ....... C=0

[CV] ..... C=100, gamma=0.0001, kernel=rbf, score=0.929, total=   0.0s
[CV] C=1000, gamma=1, kernel=rbf .....................................
[CV] ......... C=1000, gamma=1, kernel=rbf, score=0.641, total=   0.0s
[CV] C=1000, gamma=1, kernel=rbf .....................................
[CV] ......... C=1000, gamma=1, kernel=rbf, score=0.638, total=   0.0s
[CV] C=1000, gamma=1, kernel=rbf .....................................
[CV] ......... C=1000, gamma=1, kernel=rbf, score=0.643, total=   0.0s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV] ....... C=1000, gamma=0.1, kernel=rbf, score=0.641, total=   0.1s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV] ....... C=1000, gamma=0.1, kernel=rbf, score=0.638, total=   0.0s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV] ....... C=1000, gamma=0.1, kernel=rbf, score=0.643, total=   0.0s
[CV] C=1000, gamma=0.01, kernel=rbf ..................................
[CV] .

[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    1.8s finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='warn', n_jobs=None,
             param_grid={'C': [0.1, 1, 10, 100, 1000],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
                         'kernel': ['rbf']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=5)

Getting the best params and estimator

In [22]:
grid.best_params_

{'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}

In [23]:
grid.best_estimator_

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [25]:
predictions=grid.predict(ftest)

Evaluating performance

In [28]:
print(metrics.classification_report(ltest,predictions))
print('\n')
print(metrics.confusion_matrix(ltest,predictions))

              precision    recall  f1-score   support

           0       0.93      0.89      0.91        75
           1       0.93      0.96      0.94       113

    accuracy                           0.93       188
   macro avg       0.93      0.92      0.93       188
weighted avg       0.93      0.93      0.93       188



[[ 67   8]
 [  5 108]]


After tuning parameters to get best estimator we have good performance