# Introduction

Hello people, welcome to my kernel! In this kernel I am going to compare different classification algorithms with **Biomechanical Features of Orthopedic Patients** dataset.

In this kernel I am going to use 6 different algorithm:

* K-Nearest Neighbour
* Logistic Regression
* Support Vector Machine (SVC)
* Naive Bayes Classification
* Decision Tree Classification
* Random Forest Classification

Let's take a look at our schedule

# Schedule
1. Importing Libraries and Data
2. Having Idea About Data
1. Data Preprocessing for Machine Learning
1. Confusion Matrix Function
1. Machine Learning Algorithms
    * KNN
    * Logistic Regression
    * SVM Classification
    * Naive Bayes Classification
    * Decision Tree Classification
    * Random Forest Classification
1. Result
1. Conclusion


# Importing Libraries and Data

In this section I am going to only import the libraries that about the visualization and data processing. I am going to add machine learning libraries when I need them.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import seaborn as sns
from matplotlib import pyplot as plt

import warnings as wrn
wrn.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Importing Data Using Pandas
data = pd.read_csv("/kaggle/input/biomechanical-features-of-orthopedic-patients/column_2C_weka.csv")

I've imported two labeled data because I want to use Logistic Regression

# Having Idea About Data

In this section I am going to examine dataset because before the preprocessing I have to have an idea about the data. In order to do this I am going to use head(),tail() and info() methods

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.info()

* Our data does not have any NaN values so we will not fill NaN values
* Our label is object, we have to convert it to int64
* There are only 6 features in our dataset.

# Data Preprocessing for Machine Learning

In this section I am going to prepare data for machine learning. In order to do this I am going to follow these steps.

* Converting Label to Int64
* Normalizing Data
* Splitting Data 

Let's start.


## Converting Label to Int64

Maybe, you remember. There are two labels in our dataset. They are:
* Normal
* Abnormal

I am going to convert *Normal* to 1 and *Abnormal* to 0. In order to do this I am going to use list comprehension

(Python is really usefull language :D )

In [None]:
data["class"] = [1 if each == "Normal" else 0 for each in data["class"]] 

In [None]:
data.head()

In [None]:
data.tail()

## Normalizing Data

In this section I am going to normalize data. In order to do this I am going to use this formula.

#### normalized = (value - min value of the feature) / (max value of the feature - min value of the feature)

Let's implement this in python!

In [None]:
data = (data-np.min(data)) /(np.max(data)-np.min(data))

In [None]:
data.head()


## Splitting Data

In this section I am going to split data into two pieces. Train and test. In order to do this I am going to use SKLearn library's Train Test Split Function. Let's do this!

In [None]:
from sklearn.model_selection import train_test_split
x = data.drop("class",axis=1)
y = data["class"]

x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=1,test_size=0.3)

- Our train and test splits are ready, we are ready to train some machine learning features!

# Confusion Matrix Function

In this section I am going to define a function that creates a seaborn heatmap from a confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix
def plot_confusionMatrix(y_true,y_pred):
    cn = confusion_matrix(y_true=y_true,y_pred=y_pred)
    
    fig,ax = plt.subplots(figsize=(5,5))
    sns.heatmap(cn,annot=True,linewidths=1.5)
    plt.show()
    return cn

Our function is ready

# Machine Learning Algorithms

Finally we've came our main section. In this section I am going to train some different machine learning algorithms and at the final of this section I am going to compare accuracy scores. Let's start with KNN

## KNN Classification

In this section I am going to train a KNN model using sklearn library. Then I am going to save the score of this model in a variable. (I am going to save it because I want to compare the scores at the final)

In [None]:
score_list = {} # I've created this dict for saving score variables into it 

In [None]:
from sklearn.neighbors import KNeighborsClassifier 
KNN = KNeighborsClassifier(n_neighbors=22) #I've tried more than 50 values. 22 is the best value

KNN.fit(x_train,y_train)
knn_score = KNN.score(x_test,y_test)
score_list["KNN Classifier"] = knn_score
print(f"Score is {knn_score}")


Our first algorithm's score is %81. I think it is a bit low, but not bad.

In [None]:
y_true = y_test
y_pred = KNN.predict(x_test)
plot_confusionMatrix(y_true,y_pred)

## Logistic Regression 

In this section I am going to train a logistic regression model. I hope it will be better than KNN. 

In [None]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
LR.fit(x_train,y_train)

lr_score = LR.score(x_test,y_test)
score_list["Logistic Regression"] = lr_score

print(f"Score is {lr_score}")

* Our logistic regression score is %74

In [None]:
y_pred = LR.predict(x_test)
plot_confusionMatrix(y_true,y_pred)

## Support Vector Machine Classification
In this section I am going to use SVM classification.

In [None]:
from sklearn.svm import SVC 

svc = SVC()
svc.fit(x_train,y_train)
svc_score = svc.score(x_test,y_test)
score_list["SVC"] = svc_score
print(f"Score is {svc_score}")

* Our SVC score is %80. This is better than Logistic Regression Score 

In [None]:
y_true = y_test
y_pred = svc.predict(x_test)
plot_confusionMatrix(y_true,y_pred)

## Naive Bayes Classification

In this section I am going to train a NBC model.

In [None]:
from sklearn.naive_bayes import GaussianNB

nbc = GaussianNB()
nbc.fit(x_train,y_train)
nbc_score = nbc.score(x_test,y_test)
score_list["GaussianNBC"] = nbc_score

print(f"Score is {nbc_score}")

* Our score is %81. It is very similar to our KNN score

In [None]:
y_true = y_test
y_pred = nbc.predict(x_test)
plot_confusionMatrix(y_true,y_pred)

## Decision Tree Classification

In this section I am going to train a decision tree classification model.

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=1)
dtc.fit(x_train,y_train)

dtc_score = dtc.score(x_test,y_test)
score_list["DTC"] = dtc_score
print(f"Score is {dtc_score}")

Our score is %78. Although our score is low, it is still better than Logistic Regression score

In [None]:
y_true = y_test
y_pred = dtc.predict(x_test)
plot_confusionMatrix(y_true,y_pred)

## Random Forest Classification

In this section I am going to train our last model, Random Forest Classification.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=50,random_state=1)
rfc.fit(x_train,y_train)
rfc_score = rfc.score(x_test,y_test)
score_list["RFC"]=rfc_score

print(f"Score is {rfc_score}")

* Finally! Our score is %87. It is the best score of this kernel.
* Our algorithms finished. We are ready to compare the scores.

In [None]:
y_true = y_test
y_pred = rfc.predict(x_test)
plot_confusionMatrix(y_true,y_pred)

# Result

In this section I am going to compare the scores of all the algorithms. At the first of Machine Learning Section. I've defined a dictionary that contains the scores of algorithms. In this section I am going to use that.

In [None]:
score_list = list(score_list.items())

In [None]:
for alg,score in score_list:
    print(f"{alg} Score is {str(score)[:4]} ")


We've saw our scores. Let's sort them.

1. Random Forest Classification %87 Accuracy
2. Naive Bayes and KNN Classification %81 Accuracy
3. Support Vector Machine Classification %80 Accuracy
4. Decision Tree Classification %78 Accuracy
5. Logistic Regression %74 Accuracy

As we can see, for classification, the best algorithm is Random Forest Classification.

## But do not forget that, in different datasets, it will be a different algorithm 

## So it does not show that, Random Forest Classification is not the best algorithm for classification - however for this dataset it is :) -

# Conclusion

I am a beginner in Machine Learning so I might made some mistakes. If there is a problem please contact me.

However if there is not any problems. If you upvote this kernel, I would be glad.