# kNN on Biomechanical Features of Orthopedic Patients
<br></br>
![orthopedic-img-rep](https://kaggle2.blob.core.windows.net/datasets-images/2374/3987/4a58a17df89fda0afe579dde6b7f25fa/dataset-cover.jpg)

## Contents
* [Introduction](#1)
* [Importing Libraries](#2)
* [Fetching Dataset](#3)
* [Data Munging](#4)
* [Correlations](#5)
* [K Nearest Neighbors](#6)
    * [What is kNN?](#7)
    * [Overview of the Features](#8)
    * [How it Works?](#9)
    * [k=3](#10)
    * [k=7](#11)
* [Learning with kNN](#12)
    * [Normalizing](#13)
    * [Test and Train Variables](#14)
    * [Initialize and Train the Classifier](#15)
    * [Best k Values](#16)
* [Conclusions](#17)


<a id=1></a>
## Introduction
Here in this kernel, firstly I have to explore the [Biomechanical features of orthopedic patients dataset](https://www.kaggle.com/uciml/biomechanical-features-of-orthopedic-patients) by using data science techniques. Then I will be explaining what is k-Nearest Neighbors method (which we will be using in this kernel for learning), and then implement it on the dataset and get the results.

<a id=2></a>
## Importing Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

import os
print(os.listdir("../input"))

<a id=3></a>
## Fetching Dataset

In [None]:
data = pd.read_csv("../input/column_2C_weka.csv")

<a id=4></a>
## Data Munging

In [None]:
data.info()

Dataset consists of 310 entries, each indicates an individual patient; and 7 attributes, where six of them are biomechanical numeric features and the last one (`class`) is the target feature.  

Each patient is represented in the data set by six biomechanical attributes derived from the shape and orientation of the pelvis and lumbar spine (each one is a column):  

* `pelvic incidence`
* `pelvic tilt`
* `lumbar lordosis angle`
* `sacral slope`
* `pelvic radius`
* `grade of spondylolisthesis`

Let's take a glimpse of data by using `head` method.

In [None]:
data.head()

Let's investigate our target feature (`class`) a bit more.

In [None]:
data['class'].value_counts()

Our label seems like an object-type, which we do not prefer, so let's convert it to binary numeric-type.  

After conversion:
* `Abnormal`  → `1`
* `Normal` → `0`

In [None]:
data['class'] = [1 if each == 'Abnormal' else 0 for each in data['class']]

In [None]:
data.head()

<a id=5></a>
## Correlations

It's perfect time to examine correlations, since we just converted our target feature to numeric-type.

In [None]:
f, ax = plt.subplots(figsize = (10,10))
sns.heatmap(data.corr(), annot=True, linewidths=.4, fmt= '.2f',ax=ax)
plt.show()

Correlations between `class` feature and other features:  

* `pelvic incidence` : `0.35`
* `pelvic tilt` : `0.33`
* `lumbar lordosis angle` : `0.31`
* `sacral slope` : `0.21`
* `pelvic radius` : `-0.31`
* `grade of spondylolisthesis` : `0.44`  

Seems like all the features are *slightly correlated* with label, excepting `pelvic radius`.

To visualize the correlations in more details, let's use `scatter_matrix`:

In [None]:
from pandas.plotting import scatter_matrix

scatter_matrix(data, alpha = 0.8, figsize = (15,15))
plt.show()

<a id=6></a>
## k Nearest Neighbors
We will be using the method named **k Nearest Neighbors** for learning, so let's dive in to see what is under the hood!
<a id=7></a>
### What is kNN?
In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.  

In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. [Source](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)

<a id=8></a>
### How it Works?
Let's go through a visual example to have the full understanding of it!

We have two types of data in the dataset (`class 1` and `class 2`), and a test data which is unlabeled yet. These are how they look like:
![knn_overview](http://i68.tinypic.com/2q2nx1w.jpg)
<a id=9></a>
### Overview of the Features
And our data looks like this: (ignore circles for now)
![knn](http://i67.tinypic.com/64nwib.jpg)
<a id=10></a>
### k=3
If we pick `k=3`, which means our test data wil be classified by nearest three neighbours, that are 2 red squares (`class 1`) and 1 blue triangle (`class 2`). The majority is obviously `class 1`, so the test data will be classified as `class 1`, which is the red square.  
Here the schema below visualizes what I mean, where the inner circle is the test space.
![knn3](http://i63.tinypic.com/2uz9jww.jpg)
<a id=11></a>
### k=7
Let's try `k=7` instead. Now there are 3 red squares and 4 blue triangles, which means blue guys are the winner of voting system.  
The outer circle is the new test space as we change `k` to `7`.
![knn7](http://i64.tinypic.com/2ut26tg.jpg)

<a id=12></a>
## Learning with kNN

Now we know what kNN means and how it works. Let's implement in on our dataset with `scikit-learn` library.

Firstly create our `x` and `y` variables:  
* `y` →  target feature (label)
* `x` →  all the features for training, excepting label

In [None]:
y = data['class']
x = data.drop(['class'], axis = 1)

<a id=13></a>
### Normalizing
Normalizing variables is vital for the sake of healthy learning. To scale all the values between 0 and 1, we have to use the following simple formula:  
$$\large x = \frac{x - min(x)}{max(x) - min(x)} $$  
So you can think $\large53$ as $\large0.53$ after normalization, if the minimum is $\large0$ and the maximum is $\large100$ for a column.

In [None]:
x = (x - np.min(x))/(np.max(x) - np.min(x))

<a id=14></a>
### Test and Train Variables
We have to split our data to create train and test variables. To do so, we will be using `sklearn`'s `train_test_split method`.  

We set the`test_size` parameter to `0.2`, so the train values will be randomly 80% of the data.  Let's briefly describe what all four values correspond to:
* `x_train` : randomly 80% of data with features of `x` (`pelvic_incidence`, `pelvic_tilt_numeric`, etc.)
* `x_test` : randomly 20% (the rest) of data with features of `x`
* `y_train` : randomly 80% of data with feature of `y` (`class`, the target feature)
* `y_test` : randomly 20% (the rest) of data with feature of `y`  

Let's visualize what I mean:

![train_test](http://i64.tinypic.com/v5agox.jpg)

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

<a id=15></a>
### Initialize and Train the Classifier
And initialize the classifier object from `sklearn.neighbors`, then train it by using `fit` method.  
(`n_neighbors` parameter (k size) is set to `5` by default, if we do not indicate it implicitly)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

knn.fit(x_train, y_train)

<a id=16></a>
### Best k Value
Now let's create a list named `score_list`, and keep all the accuracy values of the algorithms trained in a range of (1,15), which is the number of `k`.  
Then visualize the accuracy values with a line plot.

In [None]:
score_list = []

for each in range (1,15):
    knn_o = KNeighborsClassifier(n_neighbors = each)
    knn_o.fit(x_train, y_train)
    score_list.append(knn_o.score(x_test, y_test))

plt.figure(figsize = (10,10))
plt.plot(range(1,15), score_list)
plt.xlabel('k values')
plt.ylabel('accuracy')
plt.show()

<a id=17></a>
## Conclusions

* We are getting the hıghest accuracy score as we choose `k` as `5` or `7`.
* The highest score we could get is `0.79`, which is not a pretty good score.