### In this notebook,I use Biomechanical features of orthopedic patients dataset.
* About in this dataset,
    * There are 2 task and I used the second task, the categories Disk Hernia and Spondylolisthesis were merged into a single category labelled as 'abnormal'. Thus, the second task consists in classifying patients as belonging to one out of two categories: Normal (100 patients) or Abnormal (210 patients).

    
<font color = 'red'>    
   
# Content:
    
1.  [Load and Check Data](#1)
2. [Exploratory Data Analaysis (EDA)](#2)
3. [Normalization](#3)
4. [K-Nearest Neighbors (KNN)](#4)
  

<a id = '1'></a>
 # Load and Check Data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('/kaggle/input/biomechanical-features-of-orthopedic-patients/column_2C_weka.csv')
print(plt.style.available) # look at available plot styles
plt.style.use('seaborn-dark')

<a id = '2'> </a>
# Exploratary Data Analaysis (EDA)

In [None]:
# to see features and target variable

data.head()


In [None]:
# Well know question is is there any NaN value and length of this data so lets look at info
data.info()

* As you can see:
 
 * length: 310 (range index)
 * Features are float
 * Target variables are object that is like string

In [None]:
data.describe()

* pd.plotting.scatter_matrix:
 
 * green: normal and blue: abnormal
 * c: color
 * figsize: figure size
 * diagonal: histohram of each features
 * alpha: opacity
 * s: size of marker
 * marker: marker type

In [None]:
color_list = ['blue' if i=='Abnormal' else 'green' for i in data.loc[:,'class']]
pd.plotting.scatter_matrix(data.loc[:, data.columns != 'class'],
                                       c=color_list,
                                       figsize= [15,15],
                                       diagonal='hist',
                                       alpha=0.5,
                                       s = 200,
                                       marker = '*',
                                       edgecolor= "black")
plt.show()

* Searborn library has countplot() that counts number of classes
* Also you can print it with value_counts() method


In [None]:
sns.countplot(x="class", data=data)
data.loc[:,'class'].value_counts()

In [None]:
A = data[data['class'] =='Abnormal']
N = data[data['class'] == "Normal"]

In [None]:
#scatter plot
plt.scatter(A.lumbar_lordosis_angle,A.pelvic_radius,color="blue",label="abnormal")
plt.scatter(N.lumbar_lordosis_angle,N.pelvic_radius,color="green",label="normal")
plt.xlabel("lumbar_lordosis_angle")
plt.ylabel("pelvic_radius")
plt.legend()
plt.show()

* We can say Abnormal = 1 , Normal =1

In [None]:
data['class'] = [1 if each == 'Abnormal' else 0 for each in data['class']]
y = data['class'].values
x_data = data.drop(["class"],axis=1)

In [None]:
data.tail()

<a id = '3'></a>
# Normalization

* Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information.
* We scale values between 0 and 1.

In [None]:


x = (x_data- np.min(x_data))/ (np.max(x_data)- np.min(x_data))

In [None]:
x.head()

* As you can see our values are created between zeros and ones.

<a id = '4'></a>
# K-Nearest Neighbors (KNN)

* KNN: Look at the K closest labeled data points
* Classification method.
* First we need to train our data. Train = fit
* fit(): fits the data, train the data.
* predict(): predicts the data
* x: features
* y: target variables(normal, abnormal)
* n_neighbors: K. In this example it is 3. it means that Look at the 3 closest labeled data points

# Train test split

In [None]:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3, random_state=1)

In [None]:
#knn model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3) # n_neighbors = k
knn.fit(x_train,y_train)
prediction = knn.predict(x_test)

In [None]:
prediction

In [None]:
print(" {} knn score: {}".format(3,knn.score(x_test,y_test)))

# Find k value

In [None]:
#find k value
score_list = []
for each in range(1,15):
    knn2= KNeighborsClassifier(n_neighbors = each)
    knn2.fit(x_train,y_train)
    score_list.append(knn2.score(x_test,y_test))

plt.plot(range(1,15),score_list, color='brown')
plt.xlabel("k values")
plt.ylabel("accuracy")
plt.show()

# Finding Model Complexity

In [None]:
# model complexity
neig = np.arange(1, 25)
train_accuracy = []
test_accuracy = []
# Loop over different values of k
for i, k in enumerate(neig):
    # k from 1 to 25(exclude)
    knn = KNeighborsClassifier(n_neighbors=k)
    # Fit with knn
    knn.fit(x_train,y_train)
    #train accuracy
    train_accuracy.append(knn.score(x_train, y_train))
    # test accuracy
    test_accuracy.append(knn.score(x_test, y_test))

# Plot
plt.figure(figsize=[13,8])
plt.plot(neig, test_accuracy, label = 'Testing Accuracy', color = 'orange')
plt.plot(neig, train_accuracy, label = 'Training Accuracy', color= 'purple')
plt.legend()
plt.title('-value VS Accuracy')
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.xticks(neig)
plt.savefig('graph.png')
plt.show()
print("Best accuracy is {} with K = {}".format(np.max(test_accuracy),1+test_accuracy.index(np.max(test_accuracy))))

In [None]:
# KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
x,y = data.loc[:,data.columns != 'class'], data.loc[:,'class']
knn.fit(x,y)
prediction = knn.predict(x)
print('Prediction: {}'.format(prediction))

In [None]:
prediction

In [None]:
print(" {} knn score: {}".format(20,knn.score(x_test,y_test)))

In [None]:
# model complexity
neig = np.arange(1, 25)
train_accuracy = []
test_accuracy = []
# Loop over different values of k
for i, k in enumerate(neig):
    # k from 1 to 25(exclude)
    knn = KNeighborsClassifier(n_neighbors=k)
    # Fit with knn
    knn.fit(x_train,y_train)
    #train accuracy
    train_accuracy.append(knn.score(x_train, y_train))
    # test accuracy
    test_accuracy.append(knn.score(x_test, y_test))

# Plot
plt.figure(figsize=[13,8])
plt.plot(neig, test_accuracy, label = 'Testing Accuracy', color = 'orange')
plt.plot(neig, train_accuracy, label = 'Training Accuracy', color= 'purple')
plt.legend()
plt.title('-value VS Accuracy')
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.xticks(neig)
plt.savefig('graph.png')
plt.show()
print("Best accuracy is {} with K = {}".format(np.max(test_accuracy),1+test_accuracy.index(np.max(test_accuracy))))