# KNN Classification Algorithm: Prediction for Benign or Malignant Mammographic Masses in Breast Tissue

##### References:
Tutorial on KNN Algorithm In Machine Learning | KNN Algorithm Using Python | K Nearest Neighbor | Simplilearn : https://www.youtube.com/watch?v=4HKqjENq9OU

Explanation on Confusion Matrix: https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62



#### Data Information 

##### Data : https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass

Class Distribution: benign: 516; malignant: 445

Attribute Information:

6 Attributes in total (1 goal field, 1 non-predictive, 4 predictive attributes)

1. BI-RADS assessment: 1 to 5 (ordinal, non-predictive!)
2. Age: patient's age in years (integer)
3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
6. Severity: benign=0 or malignant=1 (binominal, goal field!)

What is BI-RADS in breast cancer?
Image result for BI-RADS
BI-RADS (Breast Imaging-Reporting and Data System) is a risk assessment and quality assurance tool developed by American College of Radiology that provides a widely accepted lexicon and reporting schema for imaging of the breast. It applies to mammography, ultrasound, and MRI.

### Imports


In [2]:
import pandas as pd
import numpy as np 
import math

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import f1_score 
from sklearn.metrics import accuracy_score 

### Data Preparation

In [3]:
df = pd.read_csv("mammographic_masses.csv")
df = df.replace('?', np.NaN)
df = df.dropna()
df.head()

Unnamed: 0,BI-RADS assessment,Age,Shape,Margin,Density,Severity
0,5,67,3,5,3,1
2,5,58,4,5,3,1
3,4,28,1,1,3,0
8,5,57,1,5,3,1
10,5,76,1,4,3,1


### Split Data into Train and Test

In [4]:
X = df.iloc[:, 0:5]
X.head()

Unnamed: 0,BI-RADS assessment,Age,Shape,Margin,Density
0,5,67,3,5,3
2,5,58,4,5,3
3,4,28,1,1,3
8,5,57,1,5,3
10,5,76,1,4,3


In [5]:
y = df.iloc[:,5]
y.head()

0     1
2     1
3     0
8     1
10    1
Name: Severity, dtype: int64

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)

In [7]:
len(X_train), len(X_test), len(y_train), len(y_test)

(664, 166, 664, 166)

### Feature Scaling 


Rule of Thumb:
Any algorithm that computes distance or assumes normailty, SCALE you FEATURES!

In [8]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [9]:
# Helps determine an intial number of neighbors
def get_n_neighbors(y):
    n = round(math.sqrt(len(y)))
    if n%2 == 0:
        n_neighbors = n - 1
    else:
        n_neighbors = n
        
    return n_neighbors

In [10]:
# Define the model: Init K-NN
n = get_n_neighbors(y)
classifier = KNeighborsClassifier(n_neighbors = n, p = 2, metric = 'euclidean')

In [11]:
# Fit Model 
classifier.fit(X_train, y_train)

KNeighborsClassifier(metric='euclidean', n_neighbors=29)

In [12]:
y_pred = classifier.predict(X_test)

### Evaluate the Model


In [13]:
cm = confusion_matrix(y_test,y_pred)


    TP: True postive, Ex: Doctor tells woman who IS pregnant that she IS pregnant.
    FP: False positive, Ex: Doctor tells a women whoe IS NOT pregnant that she IS pregnant (TYPE I Error)
    FN: False negative, Ex: Doctor tells a woman who IS pregnnt that she IS NOT pregnant (TYPE II Error) 
    TN: True negative, Ex: Doctor tells a woman who IS NOT pregnant that she is NOT pregnant.

Confusion Matrix

                                 Actual Values 

                              ___True_____False__

    Predicted       Positive |  TP    |  FP    |
    Values
                    Negative |  TN    |  FN    |
                             ___________________

In [14]:
tp = cm[0,0]
fn = cm[1,1]
fp = cm[0,1]
tn = cm[1,0]

In [15]:
print("The Confusion Matrix Results")
print(" ")
print(cm)
print(" ")
print("The True Positive: " + str(tp))
print("The False Negative: "+ str(fn))
print("The False Positive: " + str(fp))
print("The True Negative: " + str(tn))

The Confusion Matrix Results
 
[[69 22]
 [15 60]]
 
The True Positive: 69
The False Negative: 60
The False Positive: 22
The True Negative: 15


#### Recall
From all the positive classes, how many we predicted correctly.
Recall should be high as possible.

In [16]:
recall = tp / (tp + fn)
recall

0.5348837209302325

#### Precision
From all the classes we have predicted as positive, how many are actually positive.

In [17]:
precision = tp / (tp + fp)
precision

0.7582417582417582

#### F-measure

It is difficult to compare two models with low precision and high recall or vice versa. So to make them comparable, we use F-Score. F-score helps to measure Recall and Precision at the same time. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values more.

In [18]:
f_measure = (2*recall*precision) / (recall + precision)
f_measure

0.6272727272727272

In [19]:
print("The f1 score " + str(f1_score(y_pred,y_test)))

The f1 score 0.7643312101910827


#### Accuracy 
How many right versus how man wrong

In [20]:
print("The accuracy score " + str(accuracy_score(y_pred,y_test)))

The accuracy score 0.7771084337349398
