#  Classification Using KNN (With Scaling)

<b> [Breast Cancer Diagnostic] </b>

Our target it to train a KNN Regression model that can predict whether the cancer is benign (B) or malignant (M).

Attribute Information:
<br>1) ID number 
<br>2) Diagnosis (M = malignant, B = benign) 
<br>3-32) Ten real-valued features are computed for each cell nucleus: 
<br>a) radius (mean of distances from center to points on the perimeter) 
<br>b) texture (standard deviation of gray-scale values) 
<br>c) perimeter 
<br>d) area 
<br>e) smoothness (local variation in radius lengths) 
<br>f) compactness (perimeter^2 / area - 1.0) 
<br>g) concavity (severity of concave portions of the contour) 
<br>h) concave points (number of concave portions of the contour) 
<br>i) symmetry 
<br>j) fractal dimension ("coastline approximation" - 1)

**`'Diagnosis'`** column is the **Dependent Variable or target column** because we want our algorithm to predict this class.

**`'1,3-32'`** are your **Features or Independent Variables** which will help you predict the Benign/Malignant class. Vary any one of them and it is going to affect your Diagnostic.

In [None]:
# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Loding the dataset into pandas dataframe.
#df = pd.read_csv('../Data/Breast_Cancer_Diagnostic.csv')

# the code is changed to read data from a github url
url='https://raw.githubusercontent.com/sujitcl/code/main/Data/Breast_Cancer_Diagnostic.csv'

df=pd.read_csv(url)

In [None]:
# Retain the 10 features and the target variable.
df = df[['radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean','diagnosis']]

In [None]:
# Check for nulls.
df.columns[df.isnull().any()]

Index([], dtype='object')

## Create the Dataframe of features (X) and the target (Y) variables

In [None]:
# Load the features to a variable X
# X is created by simply dropping the diagnosis column and retaining all others
X = df.drop('diagnosis', axis = 1)

# Load the target variable to y
y = df['diagnosis']

### KNN with K=7

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

model = KNeighborsClassifier(n_neighbors=7)

# Train the model using the training sets. 
model.fit(X_train, y_train)

# Getting predictions from the model 
y_test_hat = model.predict(X_test)

cm = confusion_matrix(y_test, y_test_hat)
print(cm)

[[100   8]
 [ 14  49]]


In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_test_hat))

              precision    recall  f1-score   support

           B       0.88      0.93      0.90       108
           M       0.86      0.78      0.82        63

    accuracy                           0.87       171
   macro avg       0.87      0.85      0.86       171
weighted avg       0.87      0.87      0.87       171



In [None]:
# Assigning Variables for convinience
TN = cm[0][0]
FP = cm[0][1]
FN = cm[1][0]
TP = cm[1][1]

recall = TP / float(FN + TP)
print("recall:", recall)

precision = TP / float(TP + FP)
print("precision:", precision)

specificity = TN / (TN + FP)
print("specificity:", specificity)

recall: 0.7777777777777778
precision: 0.8596491228070176
specificity: 0.9259259259259259


### Repeat with scaling

In [None]:
# scale the dataset first. Only the features, and not the outcome needs to be scaled.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

sX = X.copy()
sX = scaler.fit_transform(sX)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(sX, y, test_size=0.30, random_state=1)

In [None]:
model = KNeighborsClassifier(n_neighbors=7)

# Train the model using the training sets. 
model.fit(X_train, y_train)

# Getting predictions from the model 
y_test_hat = model.predict(X_test)

cm = confusion_matrix(y_test, y_test_hat)
print(cm)

[[102   6]
 [  5  58]]


In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_test_hat))

              precision    recall  f1-score   support

           B       0.95      0.94      0.95       108
           M       0.91      0.92      0.91        63

    accuracy                           0.94       171
   macro avg       0.93      0.93      0.93       171
weighted avg       0.94      0.94      0.94       171



In [None]:
# Assigning Variables for convinience
TN = cm[0][0]
FP = cm[0][1]
FN = cm[1][0]
TP = cm[1][1]

recall = TP / float(FN + TP)
print("recall:", recall)

precision = TP / float(TP + FP)
print("precision:", precision)

specificity = TN / (TN + FP)
print("specificity:", specificity)

recall: 0.9206349206349206
precision: 0.90625
specificity: 0.9444444444444444
