## Classification
The purpose of classification is to train a machine on previously known data so that the machine can later identify the class of new data.  
<br>
For example, we'll be working with breast tumor data to try to identify malignant and benign breast tumors based on attributes. The way we can do this is to take previously known samples of attributes like size and shape of the tumor as the features, and the label/class is either benign or malignant. From here, we can assess future tumors by their same attributes and predict whether or not the tumor is benign or malignant.

K Nearest Neighbors is a simple and effective machine learning classification algorithm overall.  
The way it works is completely in the name. K is a number you can choose, and then neighbors are the data points from known data. We're looking for any number of the "nearest" neighbors. Let's say K = 3, so then we're looking for the two closest neighboring points. 

## Dataset
Dataset can be downloaded <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/">here.</a>  
- *breast-cancer-wisconsin.data*: is the actual dataset
- *breast-cancer-wisconsin.names*: is the details about the data

Before starting with programming, lets modify the *breast-cancer-wisconsin.data* file. You can observe then it doesn't contain a header about what each column means. To treat it as any other csv file, just add a row at the top like this:  
`id,clump_thickness,uniform_cell_size,uniform_cell_shape,marginal_adhesion,single_epi_cell_size,bare_nuclei,bland_chromation,normal_nucleoli,mitoses,class`  
This should give a name for each column and makes it meaningful once converted to dataframe.

Lets import the libraries we need

In [1]:
import numpy as np
from sklearn import preprocessing, model_selection, neighbors
import pandas as pd

## Cleaning the data
Lets load the data and clean it.  
It is mentioned in *breast-cancer-wisconsin.names* file that missing values are represented by *'?'*. Lets replace it with a custom value.  
Lets also remove the *'id'* column as it doesn't add any useful weight to training (it will ruin the training if considered.)

In [2]:
df = pd.read_csv('data/breast-cancer-wisconsin.data')
df.replace('?',-99999, inplace=True)
df.drop(['id'], 1, inplace=True)

## Features and Labels
Now, we'll define our deatures and label:

In [3]:
X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])

## Splitting the dataset
Split the dataset into training and testing

In [4]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)

## Defining the classifier
Now we will define the classifier, train it and test it

In [5]:
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(accuracy)

0.9857142857142858


You can try commenting the `df.drop(['id'])` line above and see how the id column is messing up with our model by checking the accuracy.  
<br>
So the important point here to note is that, **removing meaningless data is as important as selecting good features**

## Lets predict!

In [6]:
test_data = np.array([[4,2,1,1,1,2,3,2,1]])
prediction = clf.predict(test_data)
print(prediction)

[2]
