# INTRODUCTION

## What is Orthopedic Biomechanics?

Orthopaedic biomechanics is about discovering and potentially optimizing the mechanical stresses experienced by normal, diseased, injured, or surgically treated bones, joints, and soft tissues.

This subfield of study is particularly influenced by two groups of specialists, namely, orthopaedic surgeons and biomechanical engineers. Orthopaedic surgeons are on the “clinical frontline,” as they treat patients by performing procedures like total or partial joint replacement, bone fracture repair, soft tissue repair, limb deformity correction, and bone tumor removal. Biomechanical engineers are on the “technological frontline,” as they discover the basic mechanical properties of human tissues, design and test the structural stress limits of orthopaedic implants, and develop new and improved biological and artificial biomaterials. Consequently, the strategy for conducting cutting-edge experimental research in orthopaedic biomechanics in hospitals, universities, and industry, includes a combination of orthopaedic surgery, mechanical testing, and medical imaging

Content : 

1. [Load and Check Data](#1)
2. [Variable Description](#2)
3. [Univariate Variable Analysis](#3)
    * [Numerical Variable](#4)
4. [Outlier Detection](#5)
5. [Missing Value](#6)
    * [Find Missing Value](#7)
6. [Visualization](#8)
    * [Correlation Between Features](#9)
7. [Modeling](#10)
    * [Train-Test Split](#11)
    * [Simple Logistic Regression](#12)
    * [KNN Classification](#13)
    * [K-Fold Cross Validation](#14)
    * [Grid Search Cross Validation with Logistic Regression](#15)
    * [Grid Search Cross Validation with KNN](#16)
    * [Ensemble Modeling](#17)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# matplotlib
import matplotlib.pyplot as plt

# seaborn
import seaborn as sns

#plotly
import plotly.io as pio
from plotly.offline import init_notebook_mode,iplot
init_notebook_mode(connected = True)
import plotly.graph_objs as go

from collections import Counter

import warnings
warnings.filterwarnings("ignore")


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id = "1"></a>
# Load and Check Data

In [None]:
data_2c = pd.read_csv("/kaggle/input/biomechanical-features-of-orthopedic-patients/column_2C_weka.csv")
data_2c.head()

In [None]:
data_2c.info()

In [None]:
data_2c.tail()

In [None]:
data_2c.columns

In [None]:
g = sns.pairplot(data_2c,hue = "class",palette = "husl")

<a id = "2"></a>
# Variable Description

1. Pelvic İncidence : Pelvic incidence is defined as the angle between a line perpendicular to the sacral plate at its midpoint and a line connecting this point to the femoral head axis.

2. Pelvic Tilt Numeric : Pelvic tilt is the orientation of the pelvis in respect to the thighbones and the rest of the body.
3. Lumbar Lordosis Angle : LLA is an ideal parameter for the evaluation of lumbar lordosis. The normal value of LLA can be defined as 20-45 degrees with a range of 1 SD

4. Sacral Slope : The sacral slope (SS) is the angle of the sacral plateau to the horizontal. The degree of the sacral slope determines the position of the lumbar spine, since the sacral plateau forms the base of the spine.

5. Pelvic Radius

6. Degree Spondylolisthesis : Spondylolisthesis can be described according to its degree of severity. One commonly used description grades spondylolisthesis, with grade 1 being least advanced, and grade 5 being most advanced. The spondylolisthesis is graded by measuring how much of a vertebral body has slipped forward over the body beneath it.

7. Class : Abnormal or normal.

In [None]:
from IPython.display import Image
Image("../input/pelvicimage1/pelvic2.jpg")

As you can see at the image, we can clearly see the features what it is at spine cord and sacrum.

<a id = "3"></a>
# Univariate Variable Analysis

In [None]:
data_2c.info()

* Categorical Variable : Class
* Numerical Variable : pelvic_incidence, pelvic_tilt numeric, lumbar_lordosis_angle, sacral_slope, pelvic_radius, degree_spondylolisthesis

<a id = "4"></a>
## Numerical Variable

In [None]:
def hist_plot(variable):
    plt.figure(figsize = (9,4))
    plt.hist(data_2c[variable],bins = 50)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution".format(variable))
    plt.show

In [None]:
numvar = ["pelvic_incidence", "pelvic_tilt numeric", "lumbar_lordosis_angle", "sacral_slope", "pelvic_radius", "degree_spondylolisthesis"]
for n in numvar:
    hist_plot(n)

In the histogram plots,we can see the value distribution of the features that we've used in data. 

<a id = "5"></a>
# Outlier Detection

In [None]:
f,ax = plt.subplots(figsize = (8,8))
sns.boxplot(data=data_2c, orient="h", palette="Set2")
plt.show()

First,with using boxplot,we can figure out and see which values are outlier.

Then we code a function to detect outliers.

In [None]:
def detect_outliers(df,features):
    outlier_indices = []
    
    for c in features:
        # 1st quartile
        Q1 = np.percentile(df[c],25)
        # 3rd quartile
        Q3 = np.percentile(df[c],75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # detect outlier and their indexes
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        
        # store indexes
        outlier_indices.extend(outlier_list_col)
    
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i,v in outlier_indices.items() if v > 1)
    
    return multiple_outliers

In this function, we decide that if each feature has at least one outlier, function must detect. 

In [None]:
data_2c.loc[detect_outliers(data_2c,["pelvic_incidence", "pelvic_tilt numeric", "lumbar_lordosis_angle", "sacral_slope", "pelvic_radius", "degree_spondylolisthesis"])]

After detecting outliers,we can drop the outliers that we've detected

In [None]:
#drop outliers
data_2c = data_2c.drop(detect_outliers(data_2c,["pelvic_incidence", "pelvic_tilt numeric", "lumbar_lordosis_angle", "sacral_slope", "pelvic_radius", "degree_spondylolisthesis"]),axis = 0).reset_index(drop = True)

In [None]:
data_2c.info()

<a id = "6"></a>
# Missing Value

After outlier detection,we can search the data for missing value.

<a id = "7"></a>
## Find Missing Value

In [None]:
data_2c.columns[data_2c.isnull().any()]

In [None]:
data_2c.isnull().sum()

As you can, there are no missing value in the data

Now we can visualize the data.

<a id = "8"></a>
# Visualization

In [None]:
data_2c.head()

<a id = "9"></a>
## Correlation Between Features

In [None]:
mask = np.zeros_like(data_2c.corr(), dtype=np.bool) 
mask[np.triu_indices_from(mask)] = True 

f, ax = plt.subplots(figsize=(16, 12))
plt.title('Pearson Correlation Matrix',fontsize=25)

sns.heatmap(data_2c.corr(),linewidths=0.25,vmax=0.7,square=True,cmap="RdBu", #"BuGn_r" to reverse 
            linecolor='w',annot=True,annot_kws={"size":12},mask=mask,cbar_kws={"shrink": .9});

We can say that;

    - Pelvic Incidence has a positive correlation with pelvic tilt numeric
    - Pelvic Incidence has a positive correlation with lumbar lordosis angle
    - Pelvic Incidence has a positive correlation with sacral slope
    - Pelvic Incidence has a negative correlation with pelvic radius
    - Pelvic Incidence has a positive correlation with degree spondylolisthesis
    
    - Pelvic Tilt Numeric has a positive correlation with lumbar lordosis angle
    - Pelvic Tilt Numeric has a positive correlation with degree spondylolisthesis
    
    - Lumbar Lordosis Angle has a positive correlation with sacral slope
    - Lumbar Lordosis Angle has a positive correlation with degree spondylolisthesis
    - Lumbar Lordosis Angle has a neagtive correlation with pelvic radius

In [None]:
f,ax = plt.subplots(figsize = (12,12))
data_2c_melt = pd.melt(data_2c,"class",var_name = "measurement")
sns.swarmplot(x="measurement", y="value", hue="class",
              palette=["r", "c", "y"], data=data_2c_melt)
plt.show()

<a id = "10"></a>
# Modeling

Now we will use ML algorithms to predict which patient has normal or abnormal features. 

We will use 5 Machine Learning algorithms such as;

- Simple Logistic Regression
- KNN Classification
- K-Fold Cross Validation
- Grid Search Cross Validation with Logistic Regression
- Grid Search Cross Validation with KNN

In [None]:
from sklearn.model_selection import train_test_split, StratifiedKFold,GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

<a id = "11"></a>
## Train-Test Split

In [None]:
data_2c.head()

In [None]:
data_2c.tail()

In [None]:
data_2c["class"] = [1 if i == "Normal" else 0 for i in data_2c["class"]]

Normal and abnormal has converted 1 or 0 for true classification and prediction

In [None]:
data_2c.head()

In [None]:
data_2c.tail()

In [None]:
y = data_2c["class"]
x_data = data_2c.drop(["class"],axis = 1)

In [None]:
y

In [None]:
# normalization
x = (x_data - np.min(x_data)) / (np.max(x_data) - np.min(x_data))

In [None]:
# train - test split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3)

print("x_train",len(x_train))
print("x_test",len(x_test))
print("y_train",len(y_train))
print("y_test",len(y_test))

In [None]:
x_train

In [None]:
x_test

In [None]:
y_train

In [None]:
y_test

<a id = "12"></a>
## Simple Logistic Regression

In [None]:
logisticreg = LogisticRegression()
logisticreg.fit(x_train,y_train)

acc_log_train = round(logisticreg.score(x_train,y_train)*100,2)
acc_log_test = round(logisticreg.score(x_test,y_test)*100,2)
print("Training = Accuracy : % {}".format(acc_log_train))
print("Testing = Accuracy : % {}".format(acc_log_test))

<a id = "13"></a>
## KNN Classification

In the beginning of classification,we choose K=3

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train,y_train)
prediction = knn.predict(x_test)

In [None]:
prediction

In [None]:
print("{} nn score : {}".format(3,knn.score(x_test,y_test)))

In [None]:
# find k value
score_list = []
for each in range(1,15):
    knn2 = KNeighborsClassifier(n_neighbors=each)
    knn2.fit(x_train,y_train)
    score_list.append(knn2.score(x_test,y_test))
    
plt.plot(range(1,15),score_list)
plt.xlabel("k values")
plt.ylabel("accuracy")
plt.show()

As you can see best accuracies and K-values.

<a id = "14"></a>
## K-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier(n_neighbors=3)
accuracies = cross_val_score(estimator=knn,X = x_train,y = y_train,cv = 10)

In [None]:
accuracies

In [None]:
print("average accuracy : ",np.mean(accuracies))
print("average std : ",np.std(accuracies))

In [None]:
# test
knn.fit(x_train,y_train)
print("test accuracy : ",knn.score(x_test,y_test))

<a id = "15"></a>
## Grid Search Cross Validation with Logistic Regression

In [None]:
param_grid = {"C" : np.logspace(-3,3,7),"penalty" : ["l1","l2"]} # l1= lasso  l2 = ridge
logisticreg = LogisticRegression()
logisticreg_cv = GridSearchCV(logisticreg,param_grid,cv = 10)
logisticreg_cv.fit(x_train,y_train)

In [None]:
print("tuned hyperparameters : (best parameters) :",logisticreg_cv.best_params_)

In [None]:
print("accuracy : ",logisticreg_cv.best_score_)

In [None]:
logisticreg2 = LogisticRegression(C = 100.0,penalty = "l2")
logisticreg2.fit(x_train,y_train)
print("score : ",logisticreg2.score(x_test,y_test))

<a id = "16"></a>
## Grid Search Cross Validation with KNN

In [None]:
grid = {"n_neighbors" : np.arange(1,50)}
knn = KNeighborsClassifier()

knn_cv = GridSearchCV(knn,grid,cv = 10)
knn_cv.fit(x,y)

In [None]:
# print hyperparameter => K value in KNN algorithm
print("tuned hyperparameter K : ",knn_cv.best_params_)
print("accuracy according to tuned parameter : ", knn_cv.best_score_)

In [None]:
random_state = 42
classifier = [DecisionTreeClassifier(random_state = random_state),
              SVC(random_state = random_state),
              RandomForestClassifier(random_state = random_state),
              LogisticRegression(random_state = random_state),
              KNeighborsClassifier()]

dt_param_grid = {"min_samples_split" : range(10,500,20),
                 "max_depth" : range(1,20,2)}

svc_param_grid = {"kernel" : ["rbf"],
                  "gamma" : [0.001,0.01,0.1,1],
                  "C" : [1,10,50,100,200,300,1000]}

rf_param_grid = {"max_features" : [1,3,10],
                 "min_samples_split" : [2,3,10],
                 "min_samples_leaf" : [1,3,10],
                 "bootstrap" : [False],
                 "n_estimators" : [100,300],
                 "criterion" : ["gini"]}

logreg_param_grid = {"C" : np.logspace(-3,3,7),
                     "penalty" : ["l1","l2"]}

knn_param_grid = {"n_neighbors" : np.linspace(1,19,10,dtype = int).tolist(),
                  "weights" : ["uniform","distance"],
                  "metric" : ["euclidean","manhattan"]}

classifier_param = [dt_param_grid,svc_param_grid,rf_param_grid,logreg_param_grid,knn_param_grid]

In [None]:
cv_result = []
best_estimators = []

for i in range(len(classifier)):
    clf = GridSearchCV(classifier[i], param_grid = classifier_param[i],cv = StratifiedKFold(n_splits = 10),scoring = "accuracy",n_jobs = -1,verbose = 1)
    clf.fit(x_train,y_train)
    cv_result.append(clf.best_score_)
    best_estimators.append(clf.best_estimator_)
    print(cv_result[i])

In [None]:
cv_results = pd.DataFrame({"Cross Validation Means": cv_result,"ML Models" : ["DecisionTreeClassifier","SVM","RandomForestClassifier",
                                                                              "LogisticRegression","KNeighborsClassifier"]})
g  = sns.barplot("Cross Validation Means","ML Models",data = cv_results)
g.set_xlabel("Mean Accuracy")
g.set_title("Cross Validation Scores")
plt.show()

As you can see in the plot,with Logistic Regression model,we can make the best predictions and get the best mean accuracy for the data.

<a id = "17"></a>
## Ensemble Modeling

In [None]:
votingC = VotingClassifier(estimators = [("dt",best_estimators[0]),
                                         ("rfc",best_estimators[2]),
                                         ("lr",best_estimators[3])],
                                         voting = "soft",n_jobs = -1)
votingC = votingC.fit(x_train,y_train)
print(accuracy_score(votingC.predict(x_test),y_test))

In the final, we can say that we can make predcitions with 0.88 accuracy with voting in Ensemble Modeling for the data that we use.