# Predicting Breast Cancer Diagnosis Using KNeighborClassifier

This dataset is from the UCI Machine Learning Repository, downloaded from Kaggle. Link [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Ten real-valued features are computed for each cell nucleus:

a) **radius** (mean of distances from center to points on the perimeter)<br>
b) **texture** (standard deviation of gray-scale values)<br>
c) **perimeter**<br>
d) **area**<br>
e) **smoothness** (local variation in radius lengths)<br>
f) **compactness** (perimeter^2 / area - 1.0)<br>
g) **concavity** (severity of concave portions of the contour)<br>
h) **concave points** (number of concave portions of the contour)<br>
i) **symmetry**<br>
j) **fractal dimension** ("coastline approximation" - 1)<br>

The columns names ending with "se" or "worst" refer to the standard error or the maximum of that feature observed, respectively.

The target column is the binary "diagnosis" column.

# Summary

#### KNN
    * Unscaled
        Test accuracy: 0.9298
        Recall: 0.83
    * Scaled
        Test accuracy:  0.6315
        Recall: 0.93
        
Using KNN is not recommeded as the best test accuracy is only 93%. Interestingly, after scaling, the test accuracy decreases dramatically, but the recall increases. Because of this, if I had to choose a KNN model, I would use the one using scaled data.

In [1]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import warnings
import matplotlib.pyplot as plt
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.inspection import permutation_importance

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
cancer = pd.read_csv('breast_cancer.csv')

cancer = cancer.drop(['Unnamed: 32', 'id'], axis = 1)

diag_map = {'B':0, 'M': 1}

cancer['diagnosis'] = cancer['diagnosis'].map(diag_map)

In [16]:
X = cancer.drop(['diagnosis'], 1)
y = cancer['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 20, stratify = y)

In [17]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Unscaled

In [18]:
knn_grid = {
    'n_neighbors': [5, 10, 50],
    'weights': ['uniform', 'distance'],
}

model_knn_grid = GridSearchCV(KNeighborsClassifier(), param_grid = knn_grid, verbose = 1, n_jobs = -1)
model_knn_grid.fit(X_train, y_train)

print(model_knn_grid.best_params_)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
{'n_neighbors': 10, 'weights': 'distance'}


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    0.1s finished


In [19]:
model_knn = KNeighborsClassifier(weights = 'distance', n_neighbors = 10)
model_knn.fit(X_train, y_train)

y_pred_knn = model_knn.predict(X_test)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_knn),
    index=["Actually 0", "Actually 1",],
    columns=["Predicted 0", "Predicted 1",],
)


print('Training Score: {}'.format(model_knn.score(X_train, y_train)))
print('Test Score: {}'.format(model_knn.score(X_test, y_test)))
print(classification_report(y_test, y_pred_knn))
print(confusion_df)

Training Score: 1.0
Test Score: 0.9298245614035088
              precision    recall  f1-score   support

           0       0.91      0.99      0.95        72
           1       0.97      0.83      0.90        42

    accuracy                           0.93       114
   macro avg       0.94      0.91      0.92       114
weighted avg       0.93      0.93      0.93       114

            Predicted 0  Predicted 1
Actually 0           71            1
Actually 1            7           35


# Scaled

In [20]:
model_knn_grid = GridSearchCV(KNeighborsClassifier(), param_grid = knn_grid, verbose = 1, n_jobs = -1)
model_knn_grid.fit(X_train_scaled, y_train)

print(model_knn_grid.best_params_)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
{'n_neighbors': 10, 'weights': 'distance'}


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  23 out of  30 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    0.1s finished


In [26]:
model_knn_scale = KNeighborsClassifier(weights = 'distance', n_neighbors = 10)
model_knn_scale.fit(X_train_scaled, y_train)

y_pred_knn_s = model_knn_scale.predict(X_test_scaled)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_knn_s),
    index=["Actually 0", "Actually 1",],
    columns=["Predicted 0", "Predicted 1",],
)

print('Training Score: {}'.format(model_knn.score(X_train_scaled, y_train)))
print('Test Score: {}'.format(model_knn.score(X_test_scaled, y_test)))
print(classification_report(y_test, y_pred_knn_s))
print(confusion_df)

Training Score: 0.6263736263736264
Test Score: 0.631578947368421
              precision    recall  f1-score   support

           0       0.96      0.97      0.97        72
           1       0.95      0.93      0.94        42

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

            Predicted 0  Predicted 1
Actually 0           70            2
Actually 1            3           39


The scaled model has worse accuracy, implying that the features of higher magnitute are more important than the features of lower magnitude.