# Classification using Haberman's Survival Data Set

This is a reimplementation of the K-Nearest Neighbors algorithm using plain Python.

In my opinion it is important to understand the "low level", not just the abstraction.

Data Set: [Haberman's Survival Data Set](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival)

In [1]:
import math

In [2]:
data = []

##### Import the data and append it to the list 

[age_of_the_patient, year_of_operation, number_of_nodes_detected, survival_status]

Check the data set's link above for more details

In [3]:
with open('dataset.data', 'r') as f:
    for line in f.readlines():
        atributes = line.strip('\n').split(',')
        data.append([int(x) for x in atributes])

##### Auxiliary function to help the visualization
Also returns key information of the data set

In [4]:
def info_dataset(data, verbose=True):
    label1, label2 = 0, 0
    data_size = len(data)
    for datum in data:
        if datum[-1] == 1:
            label1 += 1
        else:
            label2 += 1
    if verbose:
        print('Total of samples: %d' % data_size)
        print('Total label 1: %d' % label1)
        print('Total label 2: %d' % label2)
    return [len(data), label1, label2]

In [5]:
info_dataset(data)

Total of samples: 306
Total label 1: 225
Total label 2: 81


[306, 225, 81]

##### Define the train/total percentage

In [6]:
p = 0.6
_, label1, label2 = info_dataset(data,False)

##### Split the data set into train set and test set

In [7]:
train_set, test_set = [], []
max_label1, max_label2 = int(p * label1), int(p * label2)
total_label1, total_label2 = 0, 0
for sample in data:
    if (total_label1 + total_label2) < (max_label1 + max_label2):
        train_set.append(sample)
        if sample[-1] == 1 and total_label1 < max_label1:
            total_label1 += 1
        else:
            total_label2 += 1
    else:
        test_set.append(sample)

##### Define function to calculate the euclidian distance between two points
[Euclidian Distance - Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance)

In [8]:
def euclidian_dist(p1, p2):
    dim, sum_ = len(p1), 0
    for index in range(dim - 1):
        sum_ += math.pow(p1[index] - p2[index], 2)
    return math.sqrt(sum_)

##### Calculates the distance between a given sample and every other in the train set
Feeds its distances to a dictionary, the sort it and gets the nearest K neighbors;
Then it counts witch of the labels is the most recurring, and returns it. 

In [9]:
def knn(train_set, new_sample, K):
    dists, train_size = {}, len(train_set)
    
    for i in range(train_size):
        d = euclidian_dist(train_set[i], new_sample)
        dists[i] = d
    
    k_neighbors = sorted(dists, key=dists.get)[:K]
    
    qty_label1, qty_label2 = 0, 0
    for index in k_neighbors:
        if train_set[index][-1] == 1:
            qty_label1 += 1
        else:
            qty_label2 += 1
            
    if qty_label1 > qty_label2:
        return 1
    else:
        return 2

##### Example

In [10]:
print(test_set[0])
print(knn(train_set, test_set[0], 12))

[55, 58, 0, 1]
1


##### Counts the correct predictions of the test set with a given K

In [11]:
correct, K = 0, 15
for sample in test_set:
    label = knn(train_set, sample, K)
    if sample[-1] == label:
        correct += 1

In [12]:
print("Train set size: %d" % len(train_set))
print("Test set size: %d" % len(test_set))
print("Correct predicitons: %d" % correct)
print("Accuracy: %.2f%%" % (100 * correct / len(train_set)))

Train set size: 183
Test set size: 123
Correct predicitons: 93
Accuracy: 50.82%
