# Predicting the Survival Status of Breast Cancer Patients (kNN Classifier from Scratch)

**“The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast Cancer.”**

There are four variables in the dataset, including the dependent variable. We are going to use age, operation_year and axillary_nodes_count to build a lazy kNN classifier that can predict the likelihood of survival of less than five years or more than five years.

The default dataset has two categories, 1 and 2 (1 = the patient survived 5 years or longer 2 = the patient died within 5 year) . We are going to drop the year column in this exercise and leave the class lable as it is. 

We have to first prepare the data for computation.

In [331]:
import pandas as pd
df = pd.read_csv("haberman.csv", delimiter=",")
df.head()

Unnamed: 0,Age,operation_year,axillary_nodes_count,survival_status
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


In [332]:
# now we are going to drop the year column
df.columns

Index(['Age', 'operation_year', 'axillary_nodes_count', 'survival_status'], dtype='object')

In [333]:
df.drop(["operation_year"], axis=1, inplace=True)
df.head(15)

Unnamed: 0,Age,axillary_nodes_count,survival_status
0,30,1,1
1,30,3,1
2,30,0,1
3,31,2,1
4,31,4,1
5,33,10,1
6,33,0,1
7,34,0,2
8,34,9,2
9,34,30,1


In [334]:
df.shape # We have 306 rows and 3 column

(306, 3)

**We are going to create two dictionaries. One will contain the test set and the other will contain the train set. We will shuffle through the dataframe using the random function capturing 30 percent as train set.** 

kNN is a lazy model, meaning the the prediction parameters are created when a sample point is feeded into the function. This method has high computation cost. 

In [335]:
# We going to start by normalizing the data. We will create two dataframes one containing the predictor variables
# and the other, dependent variable. 
y = df["survival_status"]
X = df.drop("survival_status", axis=1)
X.head()


Unnamed: 0,Age,axillary_nodes_count
0,30,1
1,30,3
2,30,0
3,31,2
4,31,4


In [336]:
# normalzing X
import numpy as np
X = X.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
X.head(10)

Unnamed: 0,Age,axillary_nodes_count
0,0.0,0.019231
1,0.0,0.057692
2,0.0,0.0
3,0.018868,0.038462
4,0.018868,0.076923
5,0.056604,0.192308
6,0.056604,0.0
7,0.075472,0.0
8,0.075472,0.173077
9,0.075472,0.576923


In [337]:
X_train = {}
X_test = {}
# now we are going to iterate over the dataframe picking random data and putting it in the train set
import random
for i in range(len(X)):
    if random.random() < 0.70:
        X_train[i] = (np.array([X.loc[i].get("Age"), X.loc[i].get("axillary_nodes_count")]), y[i])
    else:
        X_test[i] = (np.array([X.loc[i].get("Age"), X.loc[i].get("axillary_nodes_count")]), y[i])

In [338]:
print(len(X_train))
print(len(X_test))

220
86


Now we are going to create the compareDistance function that takes in two data tuples and calculates the euclidean distance between the vectors.

In [339]:
def compareDistance(a, b):
    v1 = a[0]
    v2 = b[0]
    distance = np.sqrt(np.sum((v1 - v2)**2))
    return distance

**Remember the higher the distance, the less similar the data points are**

Now we are going to make the function that can take in a point and value k, and return all the k neighbors from that data point.

In [340]:
import operator

def neighbor(id, k):
    distance = []
    for i in X_train:
        dist = compareDistance(X_test[id], X_train[i])
        distance.append((i, dist))
    distance.sort(key=operator.itemgetter(1))
    neighbors = []
    for k in range(k):
        neighbors.append(distance[k][0])
    return neighbors

In [341]:
X_test.keys()

dict_keys([5, 9, 15, 16, 17, 20, 23, 25, 31, 32, 36, 39, 42, 43, 44, 45, 49, 51, 52, 56, 63, 73, 77, 83, 85, 92, 95, 96, 102, 103, 104, 109, 115, 119, 123, 127, 128, 129, 132, 138, 140, 143, 144, 148, 150, 152, 158, 166, 167, 170, 180, 186, 190, 192, 205, 206, 209, 210, 217, 220, 232, 236, 237, 238, 239, 240, 242, 243, 247, 248, 250, 254, 255, 256, 263, 265, 266, 267, 270, 271, 285, 287, 288, 289, 293, 305])

Now we are going to create the final function, which will run all the test points through the neighbor function and would try to  predict the class they belong. The function will return the score, how many classes it has predicted correctly. 

In [342]:
from collections import Counter  

def accuracy_score(test_set, k):
    true = 0
    total = 0
    for i in test_set:
        total += 1
        n = neighbor(i, k)
        result = Counter([X_train[p][1] for p in n])
        if max(result, key=result.__getitem__) == test_set[i][1]:
            true += 1
    return true/total


    

In [343]:
print("When running the test set our model accuracy score is {}".format(accuracy_score(X_test, 5)))

When running the test set our model accuracy score is 0.7906976744186046


**We can use the k-fold cross validation to find the set of training points that perfectly describe the sample inputs, thus increasing the accuracy.**