we are going to examine the Breast Cancer Dataset using python sklearn library to model K-nearest neighbor algorithm.
After modeling the knn classifier, we are going to use the trained knn model to predict whether the patient is suffering from the benign tumor or malignant tumor. The greatness of using  Sklearn is that it provides us the functionality to implement machine learning algorithms in a few lines of code.

This dataset consists of 10 continuous attributes and 1 target class attributes. Class attribute shows the observation result, whether the patient is suffering from the benign tumor or malignant tumor. Benign tumors do not spread to other parts while the malignant tumor is cancerous. The dataset was collected & openly distributed so as to find out some patterns from this data.

Breast Cancer Data Set Attribute Information:
1. Sample code number: id number
2. Clump Thickness: 1 – 10
3. Uniformity of Cell Size: 1 – 10
4. Uniformity of Cell Shape: 1 – 10
5. Marginal Adhesion: 1 – 10
6. Single Epithelial Cell Size: 1 – 10
7. Bare Nuclei: 1 – 10
8. Bland Chromatin: 1 – 10
9. Normal Nucleoli: 1 – 10
10. Mitoses: 1 – 10
11. Class: (2 for benign, 4 for malignant)

Problem Statement:
To model the knn classifier using the Breast Cancer data for predicting whether a patient is suffering from the benign tumor or malignant tumor.

Python packages used:

NumPy

NumPy is a Numeric Python module. It provides fast mathematical functions.

Numpy provides robust data structures for efficient computation of multi-dimensional arrays & matrices.

We used numpy to read data files into numpy arrays and data manipulation.



Scikit-Learn

It’s a machine learning library. It includes various machine learning algorithms.

We are using its Imputer, train_test_split, KNeighborsClassifier, accuracy_score algorithms.

In [1]:
import numpy as np
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [2]:
from sklearn.preprocessing import Imputer

We are using breast cancer data. You can download it from archive.ics.uci.edu website. For importing the data and manipulating it, we are going to use numpy arrays.
Using genfromtxt() method, we are importing our dataset into the 2d numpy array. You can import text files using this function. We are passing 3 parameters:

fname

It handles the filename with extension.

delimiter

The string used to separate values. In our dataset “,”(comma) is the separator.

dtype

It handles data type of variables.

All the values are numeric in our database. But some values are missing and are replaced by “?”. So, we will have to perform data imputation. Due to this reason, we are using float dtype.

In [3]:
cancer_data = np.genfromtxt(
 fname ='breast-cancer-wisconsin.data', delimiter= ',', dtype= float)

Using the above code we have imported our data into a 2d numpy array.

len(): Function to find out the no. of records in our data.

str(): Function to get an idea about the basic structure of data.

shape: To get array dimensions.

In [4]:
print("Dataset Length:: ", len(cancer_data))
print("Dataset:: ", str(cancer_data))
print("Dataset Shape:: ", cancer_data.shape)

Dataset Length::  699
Dataset::  [[1.000025e+06 5.000000e+00 1.000000e+00 ... 1.000000e+00 1.000000e+00
  2.000000e+00]
 [1.002945e+06 5.000000e+00 4.000000e+00 ... 2.000000e+00 1.000000e+00
  2.000000e+00]
 [1.015425e+06 3.000000e+00 1.000000e+00 ... 1.000000e+00 1.000000e+00
  2.000000e+00]
 ...
 [8.888200e+05 5.000000e+00 1.000000e+01 ... 1.000000e+01 2.000000e+00
  4.000000e+00]
 [8.974710e+05 4.000000e+00 8.000000e+00 ... 6.000000e+00 1.000000e+00
  4.000000e+00]
 [8.974710e+05 4.000000e+00 8.000000e+00 ... 4.000000e+00 1.000000e+00
  4.000000e+00]]
Dataset Shape::  (699, 11)


The cancer dataset’s first column consists of patient’s id. To make this prediction process unbiased, we should remove this patient id. We can use numpy delete() method for this operation.

delete(): It returns a new transformed array. Three parameters should to passed.
    

arr: It holds the array name.
    
obj: It indicates which sub-arrays to remove.
    
axis: The axis along which to delete. axis = 1 is used for columns & axis = 0 for rows.

In [5]:
cancer_data = np.delete(arr = cancer_data, obj= 0, axis = 1)

In [6]:
cancer_data

array([[ 5.,  1.,  1., ...,  1.,  1.,  2.],
       [ 5.,  4.,  4., ...,  2.,  1.,  2.],
       [ 3.,  1.,  1., ...,  1.,  1.,  2.],
       ...,
       [ 5., 10., 10., ..., 10.,  2.,  4.],
       [ 4.,  8.,  6., ...,  6.,  1.,  4.],
       [ 4.,  8.,  8., ...,  4.,  1.,  4.]])

Now, we wish to divide the dataset into feature & label dataset. i.e., feature data is predictor variables they will help us to predict labels(criterion variable). Here, first 9 columns include continuous variables that will help us to predict whether a patient is having the benign tumor or malignant tumor.

In [7]:
X = cancer_data[:,range(0,9)]
Y = cancer_data[:,9]

In [8]:
Y

array([2., 2., 2., 2., 2., 4., 2., 2., 2., 2., 2., 2., 4., 2., 4., 4., 2.,
       2., 4., 2., 4., 4., 2., 4., 2., 4., 2., 2., 2., 2., 2., 2., 4., 2.,
       2., 2., 4., 2., 4., 4., 2., 4., 4., 4., 4., 2., 4., 2., 2., 4., 4.,
       4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 2., 4., 4., 2., 4., 2., 4.,
       4., 2., 2., 4., 2., 4., 4., 2., 2., 2., 2., 2., 2., 2., 2., 2., 4.,
       4., 4., 4., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 4., 4., 4., 4.,
       2., 4., 4., 4., 4., 4., 2., 4., 2., 4., 4., 4., 2., 2., 2., 4., 2.,
       2., 2., 2., 4., 4., 4., 2., 4., 2., 4., 2., 2., 2., 4., 2., 2., 2.,
       2., 2., 2., 2., 2., 2., 4., 2., 2., 2., 4., 2., 2., 4., 2., 4., 4.,
       2., 2., 4., 2., 2., 2., 4., 4., 2., 2., 2., 2., 2., 4., 4., 2., 2.,
       2., 2., 2., 4., 4., 4., 2., 4., 2., 4., 2., 2., 2., 4., 4., 2., 4.,
       4., 4., 2., 4., 4., 2., 2., 2., 2., 2., 2., 2., 2., 4., 4., 2., 2.,
       2., 4., 4., 2., 2., 2., 4., 4., 2., 4., 4., 4., 2., 2., 4., 2., 2.,
       4., 4., 4., 4., 2.

In [9]:
X

array([[ 5.,  1.,  1., ...,  3.,  1.,  1.],
       [ 5.,  4.,  4., ...,  3.,  2.,  1.],
       [ 3.,  1.,  1., ...,  3.,  1.,  1.],
       ...,
       [ 5., 10., 10., ...,  8., 10.,  2.],
       [ 4.,  8.,  6., ..., 10.,  6.,  1.],
       [ 4.,  8.,  8., ..., 10.,  4.,  1.]])

Data Imputation:
Imputation is a process of replacing missing values with substituted values. In our dataset, some columns have missing values. We can replace missing values with mean, median, mode or any particular value.
Sklearn provides Imputer() method to perform imputation in 1 line of code. We just need to define missing_values, axis, and strategy. We are using “median” value of the column to substitute with the missing value.

In [10]:
imp = Imputer(missing_values='NaN', strategy='median')
X = imp.fit_transform(X)



rain, Test data split:
For dividing data into train data & test data. We are using train_test_split() method by sklearn.

train_test_split(): We are using 4 parameters X, Y, test_size, random_state

X, Y:  X is a numpy array consisting of feature dataset & Y contains labels for each record.


test_size: It represents the size of test data needs to split. If we use 0.4, it indicates 40% of data should be separated and saved as testing data.

random_state: It’s pseudo-random number generator state used for random sampling. If you want to replicate our results, then use the same value of random_state.

Now, X_train & y_train are training datasets. X_test & y_test are testing datasets.
y_train & y_test are 2d numpy arrays with 1 column. To convert it into a 1d array, we are using ravel().

In [18]:
X_train, X_test, y_train, y_test = train_test_split(
 X, Y, test_size = 0.3, random_state = 100)

y_train = y_train.ravel()
y_test = y_test.ravel()

KNeighborsClassifier(): This is the classifier function for KNN. It is the main function for implementing the algorithms. Some important parameters are:

n_neighbors: It holds the value of K, we need to pass and it must be an integer. If we don’t give the value of n_neighbors then by default, it takes the value as 5.

Weights: It holds a string value i.e., name of the weight function. The Weight function used in prediction. It can hold values like ‘uniform’ or ‘distance’ or any user defined function.
‘uniform’ weight used when all points in the neighborhood are weighted equally. Default value for weights taken as ‘uniform’
‘distance’ weight used for giving closer neighbors- higher weight and far neighbors-less weight, i.e., weight points by the inverse of their distance.
user defined function we can call the user defined functions. The user defined function can used when we want to produce custom weight values. It accepts distance values and returns an array of weights.

algorithm: It specifies algorithm which should be used to compute the nearest neighbors. It can values like ‘auto’, ‘ball_tree’, ‘kd_tree’, brute’. It is an optional parameter.
a) ‘ball_tree’ , ‘kd_tree’ are used to implement ball tree algorithm. These are special kind of data structures for space partitioning.
b) ‘brute’ is used to implement brute-force search algorithm.
c) ‘auto’ is used to give control to the system. By using ‘auto’, it automatically decides the best algorithm according to values of training data.fit()

data.fit(): A fit method is used to fit the model. It is passed with two parameters:X and Y. For training data fitting on KNN algorithm, this needs to call.
X: It consists of training data with features.
Y: It consists of training data with labels.predict(): It predicts class labels for the data provided as its parameters.

In [12]:
for K in range(25):
 K_value = K+1
 neigh = KNeighborsClassifier(n_neighbors = K_value, weights='uniform', algorithm='auto')
 neigh.fit(X_train, y_train) 
 y_pred = neigh.predict(X_test)
 print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:",K_value)

Accuracy is  95.23809523809523 % for K-Value: 1
Accuracy is  93.33333333333333 % for K-Value: 2
Accuracy is  95.71428571428572 % for K-Value: 3
Accuracy is  95.23809523809523 % for K-Value: 4
Accuracy is  95.71428571428572 % for K-Value: 5
Accuracy is  94.76190476190476 % for K-Value: 6
Accuracy is  94.76190476190476 % for K-Value: 7
Accuracy is  94.28571428571428 % for K-Value: 8
Accuracy is  94.76190476190476 % for K-Value: 9
Accuracy is  94.28571428571428 % for K-Value: 10
Accuracy is  94.28571428571428 % for K-Value: 11
Accuracy is  94.76190476190476 % for K-Value: 12
Accuracy is  94.76190476190476 % for K-Value: 13
Accuracy is  93.80952380952381 % for K-Value: 14
Accuracy is  93.80952380952381 % for K-Value: 15
Accuracy is  93.80952380952381 % for K-Value: 16
Accuracy is  93.80952380952381 % for K-Value: 17
Accuracy is  93.80952380952381 % for K-Value: 18
Accuracy is  93.80952380952381 % for K-Value: 19
Accuracy is  93.80952380952381 % for K-Value: 20
Accuracy is  93.8095238095238

In [13]:
X_test[0]

array([2., 1., 1., 1., 2., 1., 1., 1., 1.])

In [14]:
y_pred=neigh.predict(X_test)

In [15]:
y_pred[0]

2.0