In [282]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 

warnings.simplefilter('ignore')

**Handle Data**: Open the dataset from CSV and split into test/train datasets.

In [2]:
cancer_data = pd.read_csv("haberman.csv",names=['age','op_year','no_pos_lymph_nodes','survival_status'])

In [9]:
print(cancer_data.shape)
cancer_data.head()

(306, 4)


Unnamed: 0,age,op_year,no_pos_lymph_nodes,survival_status
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


In [8]:
cancer_data["survival_status"].value_counts()

1    225
2     81
Name: survival_status, dtype: int64

## KNN step by step Implementation
   ** Next we need to split the data set randomly into train and datasets. 
    A ratio of 67/33 for train/test is a standard ratio used. **

In [19]:
import csv
import random
def loadDataset(filename, split, trainingSet=[] , testSet=[]):
    with open(filename, 'rt') as csvfile:
        lines = csv.reader(csvfile)
        dataset = list(lines)
        for x in range(len(dataset)-1):
            for y in range(4):
                dataset[x][y] = float(dataset[x][y])
            if random.random() < split:
                trainingSet.append(dataset[x])
            else:
                testSet.append(dataset[x])

In [50]:
trainingSet=[]
testSet=[]
loadDataset('haberman.csv', 0.66, trainingSet, testSet)
print('Train: ' + repr(len(trainingSet)))
print('Test: ' + repr(len(testSet)))

Train: 202
Test: 103


In order to make predictions we need to calculate the similarity between any two given data instances. This is needed so that we can locate the k most similar data instances in the training dataset for a given member of the test dataset and in turn make a prediction.

Given that all four flower measurements are numeric and have the same units, we can directly use the Euclidean distance measure. This is defined as the square root of the sum of the squared differences between the two arrays of numbers (read that again a few times and let it sink in).

Additionally, we want to control which fields to include in the distance calculation. Specifically, we only want to include the first 3 attributes. One approach is to limit the euclidean distance to a fixed length, ignoring the final dimension.

Putting all of this together we can define the euclideanDistance function as follows

In [22]:
import math
def euclideanDistance(instance1, instance2, length):
    distance = 0
    for x in range(length):
        distance += pow((instance1[x] - instance2[x]), 2)
    return math.sqrt(distance)

Neighbors

Now that we have a similarity measure, we can use it collect the k most similar instances for a given unseen instance.

This is a straight forward process of calculating the distance for all instances and selecting a subset with the smallest distance values.

Below is the getNeighbors function that returns k most similar neighbors from the training set for a given test instance (using the already defined euclideanDistance function)

In [29]:
import operator 
def getNeighbors(trainingSet, testInstance, k):
    distances = []
    length = len(testInstance)-1
    for x in range(len(trainingSet)):
        dist = euclideanDistance(testInstance, trainingSet[x], length)
        distances.append((trainingSet[x], dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(k):
        neighbors.append(distances[x][0])
    return neighbors

In [281]:
def getResponse(neighbors):
    classVotes = {}
    for x in range(len(neighbors)):
        response = neighbors[x][-1]
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
    return sortedVotes[0][0]

An easy way to evaluate the accuracy of the model is to calculate a ratio of the total correct predictions out of all predictions made, called the classification accuracy.

Below is the getAccuracy function that sums the total correct predictions and returns the accuracy as a percentage of correct classifications.

In [40]:
def getAccuracy(testSet, predictions):
    correct = 0
    for x in range(len(testSet)):
        if testSet[x][-1] is predictions[x]:
            correct += 1
    return (correct/float(len(testSet))) * 100.0

 #### It is lazy because it only does work at the last second. This has the benefit of only including data relevant to the unseen data, called a localized model.
 
 #### A disadvantage is that it can be computationally expensive to repeat the same or similar searches over larger training datasets.
 
 #### Finally, kNN is powerful because it does not assume anything about the data, other than a distance measure can be calculated consistently between any two instances. As such, it is called non-parametric or non-linear as it does not assume a functional form.

### Classify Cancer Cases using 
    'Age'
    'Operation year'
    'Number of Positive Lymph Nodes'

In [150]:
cancer_data.head()

Unnamed: 0,age,op_year,no_pos_lymph_nodes,survival_status
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


#### we can split the data into training and test datasets and use the results to evaluate our algorithm implementation. Good classification accuracy on this problem is above 90% correct, typically 96% or better.

In [47]:
# generate predictions
predictions=[]

def predict(trainingSet,testSet,k):
    for x in range(len(testSet)):
        neighbors = getNeighbors(trainingSet, testSet[x], k)
        result = getResponse(neighbors)
        predictions.append(result)
        print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))


In [None]:
predict(trainingSet,testSet,3)

In [None]:
accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) + '%')

## KNN Implementation with sklearn lib KNeighborsClassifier

In [157]:
cancer_data.head()

Unnamed: 0,age,op_year,no_pos_lymph_nodes,survival_status
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


In [308]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

def getAccuracyOfPredictions(y_test,survival_prediction):
    return metrics.accuracy_score(y_test, survival_prediction) * 100

def knnDataSplitFitPredictAccuracy(cancer_feature_data,cancer_target_data,k_list):
    for k in k_list:    
        X_train, X_test, y_train, y_test = train_test_split(cancer_feature_data, cancer_target_data, test_size=0.3) 
        knnc = KNeighborsClassifier(n_neighbors=k)
        model = knnc.fit(X_train,y_train) 
        survival_prediction = knnc.predict(X_test)
        accuracy = getAccuracyOfPredictions(y_test,survival_prediction)
        print("K = {0} Accuracy %:{1}".format(k,accuracy))   
    return survival_prediction

# After EDA feature selection

Observations

    After anlaysing through various plots It can be infered that greater the number of positive auxiallary lumph nodes found lesser the chance of survival.
        Age is not much a impacting factor as even many older people have survived those who had less number of positive lumph nodes.
        Operation year also dosent have weight on the outcome. As medical advancement were slow during those years.
    From Kolmogorov-Smirnov statistic on survived and not servived data:
        we can not reject the hypothesis as pvalue is 90%+ which is high and statistic value is low thats 7% around.



In [169]:
cancer_data.head()

Unnamed: 0,age,op_year,no_pos_lymph_nodes,survival_status
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


### with age and lymph nodes

In [327]:
k_list = [2,3,5,8] 

print("With all features")
cancer_feature_data = cancer_data.iloc[:, [0,1,2]]
cancer_target_data = cancer_data.iloc[:, [3]]
all_feat_prediction = knnDataSplitFitPredictAccuracy(cancer_feature_data,cancer_target_data,k_list)

print("\nConsidering only Number of Positive lymph nodes and Age")
cancer_feature_data = cancer_data.iloc[:, [0,2]]
cancer_target_data = cancer_data.iloc[:, [3]]
two_feat_prediction = knnDataSplitFitPredictAccuracy(cancer_feature_data,cancer_target_data,k_list)


With all features
K = 2 Accuracy %:73.91304347826086
K = 3 Accuracy %:60.86956521739131
K = 5 Accuracy %:76.08695652173914
K = 8 Accuracy %:73.91304347826086

Considering only Number of Positive lymph nodes and Age
K = 2 Accuracy %:76.08695652173914
K = 3 Accuracy %:76.08695652173914
K = 5 Accuracy %:80.43478260869566
K = 8 Accuracy %:72.82608695652173


# It is able to predict more accurately with only Age and Number of Lymph nodes

In [None]:
## The line / model
plt.scatter(y_test, two_feat_prediction)
plt.xlabel("True Values")
plt.ylabel("Predictions")

In [330]:
#print("Score:", model.score(X_test, y_test))