# Predicting Breast Cancer Diagnosis Using SVC

This dataset is from the UCI Machine Learning Repository, downloaded from Kaggle. Link [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Ten real-valued features are computed for each cell nucleus:

a) **radius** (mean of distances from center to points on the perimeter)<br>
b) **texture** (standard deviation of gray-scale values)<br>
c) **perimeter**<br>
d) **area**<br>
e) **smoothness** (local variation in radius lengths)<br>
f) **compactness** (perimeter^2 / area - 1.0)<br>
g) **concavity** (severity of concave portions of the contour)<br>
h) **concave points** (number of concave portions of the contour)<br>
i) **symmetry**<br>
j) **fractal dimension** ("coastline approximation" - 1)<br>

The columns names ending with "se" or "worst" refer to the standard error or the maximum of that feature observed, respectively.

The target column is the binary "diagnosis" column.

# Summary

#### SVC after 5-fold cross-validation
    * Unscaled
        Test accuracy:  0.9292
        Recall: 0.9055
    * Scaled
        Test accuracy:  0.9300
        Recall: 0.9055
    * Unscaled after dropping low-importance columns:
        Test accuracy: 0.9296 
        Recall: 0.9055

In [32]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import warnings
import matplotlib.pyplot as plt
import numpy as np

from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, cross_validate
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, recall_score, make_scorer

In [2]:
cancer = pd.read_csv('breast_cancer.csv')

cancer = cancer.drop(['Unnamed: 32', 'id'], axis = 1)

diag_map = {'B':0, 'M': 1}

cancer['diagnosis'] = cancer['diagnosis'].map(diag_map)

In [3]:
X = cancer.drop('diagnosis', 1)
y = cancer['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 20, stratify = y)

In [4]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Unscaled

In [5]:
svc_grid = {
    'kernel': ['linear', 'poly', 'rbf'],
    'C': [1, 10, 20],
    'degree': [2, 3, 5]
}

model_svc_grid = GridSearchCV(SVC(), param_grid = svc_grid, verbose = 1, n_jobs = -1)
model_svc_grid.fit(X_train, y_train)

print(model_svc_grid.best_params_)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   12.4s
[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed:  1.6min finished


{'C': 10, 'degree': 2, 'kernel': 'linear'}


In [6]:
%time
model_svc = SVC(kernel = 'linear', C = 10)
model_svc.fit(X_train, y_train)

y_pred_svc = model_svc.predict(X_test)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_svc),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)

print(model_svc.score(X_train, y_train))
print(model_svc.score(X_test, y_test))
print(classification_report(y_test, y_pred_svc))
print(confusion_df)

Wall time: 0 ns
0.978021978021978
0.956140350877193
              precision    recall  f1-score   support

           0       0.95      0.99      0.97        72
           1       0.97      0.90      0.94        42

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              71               1
Actually Mal.               4              38


In [7]:
cv_scores = cross_val_score(model_svc, X_test, y_test, cv= 5)
print('Cross val test accuracy: {}'.format(cv_scores.mean()))

Cross val test accuracy: 0.9292490118577075


In [33]:
cv_scores = cross_val_score(model_svc, X_test, y_test, cv = 5, scoring = make_scorer(recall_score))
print('Mean cross val recall: {}'.format(cv_scores.mean()))

Mean cross val recall: 0.9055555555555556


In [8]:
coef = model_svc_grid.best_estimator_.coef_[0]
im_df = pd.DataFrame({"feat": X_train.columns, "coef_sq": coef **2})
im_df.sort_values("coef_sq", ascending=False)

Unnamed: 0,feat,coef_sq
28,symmetry_worst,43.154818
0,radius_mean,33.737193
11,texture_se,21.95754
26,concavity_worst,13.790124
6,concavity_mean,7.864922
27,concave points_worst,6.963214
24,smoothness_worst,5.220811
8,symmetry_mean,4.072903
7,concave points_mean,3.472156
15,compactness_se,2.348802


# Scaled

In [9]:
model_svc_grid_s = GridSearchCV(SVC(), param_grid = svc_grid, verbose = 1, n_jobs = -1)
model_svc_grid_s.fit(X_train_scaled, y_train)

print(model_svc_grid_s.best_params_)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


{'C': 10, 'degree': 2, 'kernel': 'rbf'}


[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed:    0.3s finished


In [30]:
# Different hyperparams lead to better test accuracy, so I used those.

model_svc_scale = SVC(kernel = 'rbf', C = 1)
model_svc_scale.fit(X_train_scaled, y_train)

y_pred_svc_s = model_svc_scale.predict(X_test_scaled)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_svc_s),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)

print('Training Score: {}'.format(model_svc_scale.score(X_train_scaled, y_train)))
print('Test Score: {}'.format(model_svc_scale.score(X_test_scaled, y_test)))
print(classification_report(y_test, y_pred_svc_s))
print(confusion_df)

Training Score: 0.989010989010989
Test Score: 0.9824561403508771
              precision    recall  f1-score   support

           0       0.99      0.99      0.99        72
           1       0.98      0.98      0.98        42

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              71               1
Actually Mal.               1              41


In [11]:
cv_scores = cross_val_score(model_svc_scale, X_test_scaled, y_test, cv= 5)
print('Cross val test accuracy: {}'.format(cv_scores.mean()))

Cross val test accuracy: 0.9300395256916996


In [34]:
cv_scores = cross_val_score(model_svc_scale, X_test_scaled, y_test, cv = 5, scoring = make_scorer(recall_score))
print('Mean cross val recall: {}'.format(cv_scores.mean()))

Mean cross val recall: 0.9055555555555556


# Dropping low-importace columns

Feature importances can only be done for the unscaled data because that model uses a linear kernel.

In [21]:
cancer_dropped_u = cancer.drop(columns=['fractal_dimension_mean',
                                        'area_mean',
                                        'area_worst',
                                       'perimeter_mean',
                                       'fractal_dimension_worst',
                                       'perimeter_se',
                                       'smoothness_se',
                                       'area_se',
                                       'concave points_se',
                                       'perimeter_worst',
                                       'texture_mean',
                                       'area_worst'])


X = cancer_dropped_u.drop(columns = 'diagnosis')

y = cancer_dropped_u['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 20, stratify = y)

In [22]:
svc_grid = {
    'kernel': ['linear', 'poly', 'rbf'],
    'C': [1, 10, 20],
    'degree': [2, 3, 5]
}

model_svc_grid = GridSearchCV(SVC(), param_grid = svc_grid, verbose = 1, n_jobs = -1)
model_svc_grid.fit(X_train, y_train)

print(model_svc_grid.best_params_)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


{'C': 20, 'degree': 2, 'kernel': 'linear'}


[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed:    0.5s finished


In [24]:
%time
model_svc_op = SVC(kernel = 'linear', C = 20)
model_svc_op.fit(X_train, y_train)

y_pred_svc = model_svc_op.predict(X_test)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_svc),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)

print(model_svc_op.score(X_train, y_train))
print(model_svc_op.score(X_test, y_test))
print(classification_report(y_test, y_pred_svc))
print(confusion_df)

Wall time: 0 ns
0.9824175824175824
0.9473684210526315
              precision    recall  f1-score   support

           0       0.93      0.99      0.96        72
           1       0.97      0.88      0.93        42

    accuracy                           0.95       114
   macro avg       0.95      0.93      0.94       114
weighted avg       0.95      0.95      0.95       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              71               1
Actually Mal.               5              37


In [26]:
 cv_scores = cross_val_score(model_svc_op, X_test, y_test, cv= 5)
print('Cross val test accuracy: {}'.format(cv_scores.mean()))

Cross val test accuracy: 0.9296442687747035


In [35]:
cv_scores = cross_val_score(model_svc_op, X_test_op, y_test, cv = 5, scoring = make_scorer(recall_score))
print('Mean cross val recall: {}'.format(cv_scores.mean()))

Mean cross val recall: 0.9055555555555556
