<h1>Lower Back Pain Classification Algorithm </h1>

<p>This dataset contains the anthropometric measurements of the curvature of the spine to support the model towards a more accurate classification.
<br />
Lower back pain affects around 80% of individuals at some point in their life. If this model becomes robust enough, then these measurements may soon become predictive and treatable measures. 
<br /> 
<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.471.4845&rep=rep1&type=pdf">This study</a> asserts the validity of the manual goniometer measurements as a valid clinical tool. </p>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

# read data into dataset variable
data = pd.read_csv("../input/Dataset_spine.csv")

# Drop the unnamed column in place (not a copy of the original)#
data.drop('Unnamed: 13', axis=1, inplace=True)

# Concatenate the original df with the dummy variables
data = pd.concat([data, pd.get_dummies(data['Class_att'])], axis=1)

# Drop unnecessary label column in place. 
data.drop(['Class_att','Normal'], axis=1, inplace=True)

In [None]:
data.info()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

<h1>Exploratory Data Analysis </h1>

In [None]:
data.columns = ['Pelvic Incidence','Pelvic Tilt','Lumbar Lordosis Angle','Sacral Slope','Pelvic Radius', 'Spondylolisthesis Degree', 'Pelvic Slope', 'Direct Tilt', 'Thoracic Slope', 'Cervical Tilt','Sacrum Angle', 'Scoliosis Slope','Outcome']

corr = data.corr()

# Set up the matplot figure
f, ax = plt.subplots(figsize=(12,9))

#Draw the heatmap using seaborn
sns.heatmap(corr, cmap='inferno', annot=True)

In [None]:
data.describe()

In [None]:
from pylab import *
import copy
outlier = data[["Spondylolisthesis Degree", "Outcome"]]
#print(outlier[outlier >200])
abspond = outlier[outlier["Spondylolisthesis Degree"]>15]
print("1= Abnormal, 0=Normal\n",abspond["Outcome"].value_counts())


#   Dropping Outlier
data = data.drop(115,0)
colr = copy.copy(data["Outcome"])
co = colr.map({1:0.44, 0:0.83})

#   Plot scatter
plt.scatter(data["Cervical Tilt"], data["Spondylolisthesis Degree"], c=co, cmap=plt.cm.RdYlGn)
plt.xlabel("Cervical Tilt")
plt.ylabel("Spondylolisthesis Degree")

colors=[ 'c', 'y', 'm',]
ab =data["Outcome"].where(data["Outcome"]==1)
no = data["Outcome"].where(data["Outcome"]==0)
plt.show()
# UNFINISHED ----- OBJECTIVE: Color visual by Outcome - 0 for green, 1 for Red (example)

In [None]:
#   Create the training dataset
training = data.drop('Outcome', axis=1)
testing = data['Outcome']

In [None]:
#   Import necessary ML packages
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

#   Split into training/testing datasets using Train_test_split
X_train, X_test, y_train, y_test = train_test_split(training, testing, test_size=0.33, random_state=22, stratify=testing)

<h1> Convert DataFrame Object to a numpy array due to faster computation in modelling</h1>

In [None]:
import numpy as np

# convert to numpy.ndarray and dtype=float64 for optimal
array_train = np.asarray(training)
array_test = np.asarray(testing)

#   Convert each pandas DataFrame object into a numpy array object. 
array_XTrain, array_XTest, array_ytrain, array_ytest = np.asarray(X_train), np.asarray(X_test), np.asarray(y_train), np.asarray(y_test)

<h1> Employing Support Vector Machine as a Classifier - 85% </h1>

In [None]:
#    Import Necessary Packages
from sklearn import svm
from sklearn.metrics import accuracy_score

#   Instantiate the classifier
clf = svm.SVC(kernel='linear')

#   Fit the model to the training data
clf.fit(array_XTrain, array_ytrain)

#   Generate a prediction and store it in 'pred'
pred = clf.predict(array_XTest)

#   Print the accuracy score/percent correct
svmscore = accuracy_score(array_ytest, pred)
print("Support Vector Machines are ", svmscore*100, "accurate")


<h1> Employing Linear Regression as a Classifier - 82% </h1>

In [None]:
estimators = [('clf', LogisticRegression())]

pl = Pipeline(estimators)

pl.fit(X_train, y_train)

accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data",accuracy)

In [None]:
ypred = pl.predict(X_test)

In [None]:
report = classification_report(y_test, ypred)
print(report)

<h1> That's it! </h1>
<p>~85% prediction accuracy with Support Vector Machines!  To increase the accuracy of the model, feature engineering is a suitable solution - as well as creating new variables based on domain knowledge.</p>