# Session 1
## Experiment 2
### Lab

### Data Source
In this experiment, we will use Wisconsin Breast Cancer data to detect malignant cells.

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data

The data has been modified:
  * The id field has been removed
  * The diagnosis field has been moved to the end

### Data Attributes

Number of instances: 569 

Number of attributes: 31 (diagnosis, 30 real-valued input features)

Ten real-valued features are computed for each cell nucleus:

	a) radius (mean of distances from center to points on the perimeter)
	b) texture (standard deviation of gray-scale values)
	c) perimeter
	d) area
	e) smoothness (local variation in radius lengths)
	f) compactness (perimeter^2 / area - 1.0)
	g) concavity (severity of concave portions of the contour)
	h) concave points (number of concave portions of the contour)
	i) symmetry 
	j) fractal dimension ("coastline approximation" - 1)

The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.  For instance, field 1 is Mean Radius, field 11 is Radius SE, field 21 is Worst Radius. All feature values are recoded with four significant digits.

The last field is diagnosis: M for Malignant and B for Benign

Class distribution: 357 benign, 212 malignant

In [2]:
import pandas as pd
pd.read_csv("../Datasets/wdbc.data", header=None)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
0,17.990,10.38,122.80,1001.0,0.11840,0.27760,0.300100,0.147100,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.71190,0.26540,0.4601,0.11890,M
1,20.570,17.77,132.90,1326.0,0.08474,0.07864,0.086900,0.070170,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.24160,0.18600,0.2750,0.08902,M
2,19.690,21.25,130.00,1203.0,0.10960,0.15990,0.197400,0.127900,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.45040,0.24300,0.3613,0.08758,M
3,11.420,20.38,77.58,386.1,0.14250,0.28390,0.241400,0.105200,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.68690,0.25750,0.6638,0.17300,M
4,20.290,14.34,135.10,1297.0,0.10030,0.13280,0.198000,0.104300,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.40000,0.16250,0.2364,0.07678,M
5,12.450,15.70,82.57,477.1,0.12780,0.17000,0.157800,0.080890,0.2087,0.07613,...,23.75,103.40,741.6,0.17910,0.52490,0.53550,0.17410,0.3985,0.12440,M
6,18.250,19.98,119.60,1040.0,0.09463,0.10900,0.112700,0.074000,0.1794,0.05742,...,27.66,153.20,1606.0,0.14420,0.25760,0.37840,0.19320,0.3063,0.08368,M
7,13.710,20.83,90.20,577.9,0.11890,0.16450,0.093660,0.059850,0.2196,0.07451,...,28.14,110.60,897.0,0.16540,0.36820,0.26780,0.15560,0.3196,0.11510,M
8,13.000,21.82,87.50,519.8,0.12730,0.19320,0.185900,0.093530,0.2350,0.07389,...,30.73,106.20,739.3,0.17030,0.54010,0.53900,0.20600,0.4378,0.10720,M
9,12.460,24.04,83.97,475.9,0.11860,0.23960,0.227300,0.085430,0.2030,0.08243,...,40.68,97.65,711.4,0.18530,1.05800,1.10500,0.22100,0.4366,0.20750,M


Obviously we cannot do much plotting with these many dimensions :-) Let us recollect the code we have already used.

In [3]:
import math
import collections
def dist(a, b):
    sqSum = 0
    for i in range(len(a)):
        sqSum += (a[i] - b[i]) ** 2
    return math.sqrt(sqSum)
# ------------------------------------------------ #
# We are assuming that the label is the last field #
# If not, munge the data to make it so!            #
# ------------------------------------------------ #
def kNN(k, train, given):
    distances = []
    for t in train:
        distances.append((dist(t[:-1], given), t[-1]))
    distances.sort()
    return distances[:k]

def kNN_classify(k, train, given):
    tally = collections.Counter()
    for nn in kNN(k, train, given):
        tally.update(nn[-1])
    return tally.most_common(1)[0]

In [5]:
wdbc = pd.read_csv("../Datasets/wdbc.data", header=None)
wdbc.values

array([[17.99, 10.38, 122.8, ..., 0.4601, 0.1189, 'M'],
       [20.57, 17.77, 132.9, ..., 0.275, 0.08902, 'M'],
       [19.69, 21.25, 130.0, ..., 0.3613, 0.08757999999999999, 'M'],
       ..., 
       [16.6, 28.08, 108.3, ..., 0.2218, 0.0782, 'M'],
       [20.6, 29.33, 140.1, ..., 0.4087, 0.124, 'M'],
       [7.76, 24.54, 47.92, ..., 0.2871, 0.07039, 'B']], dtype=object)

But where are the test data? In practice, as a designer of the algorithm you are given only one set of data. The real test data is with your teacher/examiner/customer. The standard way is to create a small validation data from your training data and use it for evaluating the performance and also for parameter tuning. Let us *randomly* split the data into 80:20 ratio. We will use 80% for training and the rest 20% for evaluating the performance. 

In [6]:
import random
picker = list(range(wdbc.shape[0]))
random.shuffle(picker)       ### randomly shuffle the data
trainMax = int(len(picker) * 0.8)
train = []
test = []
for pick in picker[:trainMax]:
    train.append(list(wdbc.values[pick]))         ### select 80% of data to be used as training set
for pick in picker[trainMax:]:
    test.append(list(wdbc.values[pick]))       ### select 20% of data to be used as test set


In [7]:
print(kNN_classify(5, train, test[0])[0], test[0][-1])
print(kNN_classify(5, train, test[4])[0], test[4][-1])


M M
M M


**Exercise 1** :: Using all the 32 attributes/features, predict the patients in the testdata as malignant or benign. Use K = 5 and Euclidean distance. Find the accuracy (as percentage) on the test data.

In [8]:
def accuracy(train,test,k):
    results = []
    for t in test:
        results.append(kNN_classify(k,train,t)[0] == t[-1])
    print("Accuracy for k = ",k," is ", results.count(True)/len(test)*100)    
    
accuracy(train,test,5)   

Accuracy for k =  5  is  95.6140350877193


**Exercise 2** :: (*Cross Validation*) Repeat the above (creating random partitions and evaluating the performance) 5 times. You will see that they vary in different trials. An average of these attempts is in fact a better estimate.

In [9]:
import random
def randomEvaluations(n,k):
    for i in range(n):
        picker = list(range(wdbc.shape[0]))
        random.shuffle(picker)       ### randomly shuffle the data
        trainMax = int(len(picker) * 0.8)
        train = []
        test = []
        for pick in picker[:trainMax]:
            train.append(list(wdbc.values[pick]))         ### select 80% of data to be used as training set
        for pick in picker[trainMax:]:
            test.append(list(wdbc.values[pick]))       ### select 20% of data to be used as test set
        accuracy(train,test,k)

**Exercise 3** :: Vary K from 3 to 11 and find the best K. In practice, we will use similar ideas for finding best hyper-parameters. We will see more at a later stage.

In [10]:
for k in range(3,12):
    randomEvaluations(1,k)

Accuracy for k =  3  is  92.98245614035088
Accuracy for k =  4  is  92.98245614035088
Accuracy for k =  5  is  92.10526315789474
Accuracy for k =  6  is  93.85964912280701
Accuracy for k =  7  is  93.85964912280701
Accuracy for k =  8  is  97.36842105263158
Accuracy for k =  9  is  93.85964912280701
Accuracy for k =  10  is  92.98245614035088
Accuracy for k =  11  is  90.35087719298247


## Summary:
We have now seen how KNN works in practice on a real data. We also have some hints on how to find the “best” hyper-parameters by “internally testing” the performance on a small portion of the data that we have. Also we have seen that there could be ”statistical variation” in performance across different splits of data and a more reliable way to measure the performance is to find the average performance (classification accuracy in this case) across multiple splits of data.