# Lower Back Pain Syndrom

This dataset is provided by sammy123 on [Kaggle](https://www.kaggle.com/sammy123/lower-back-pain-symptoms-dataset).

## Context

Lower back pain can be caused by a variety of problems with any parts of the complex, interconnected network of spinal muscles, nerves, bones, discs or tendons in the lumbar spines. Typical sources of low back pain include:
* The large nerve roots in the low back that go to the legs may be irritated.
* The smaller nerves that supply the low back may be irritated.
* The large paired lower back muscles (erector spinae) may be strained.
* The bones, ligaments or joints may be damaged.
* An intervertebral disc may be degenerating.

An irritation or problem with any of these structures can cause lower back pain and/or pain that radiates or is referred to other parts of the body. Many lower back problems can also cause back muscle spasms, which do not sound like much but can cause severe pain and disability.

While lower back pain is extremely common, the symptoms and severity of lower back pain vary greatly. A simple lower back muscle strain might be excruciating enough to necessitate an emergency room visit, while a degenerating disc might cause only mild, intermittent discomfort.

## Question

How identify an abnormal or normal person using collected physical spine details and data?

In [None]:
# Load the librairies
get_ipython().magic('matplotlib inline')
import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn import metrics
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

In [None]:
# Import the data
data = pd.read_csv('../input/Dataset_spine.csv', decimal='.', sep=',', header=0)
data = data.drop('Unnamed: 13', 1)
data.columns = ['pelvic_incidence', 'pelvic_tilt',
                'lumbar_lordosis_angle', 'sacral_slope',
                'pelvic_radius', 'degree_spondylolisthesis',
                'pelvic_slope', 'direct_tilt',
                'thoracic_slope', 'cervical_tilt',
                'sacrum_angle', 'scoliosis_slope',
                'class']

In [None]:
data.head()

## Exploration of the data

* Let's check if there are some missing values in this dataset.

In [None]:
data.info()

No missing data for this dataset.

* Compute some basic statistics about the data.

In [None]:
data.describe()

No results seem to be unusual, except for the maximum of the *Degree Spondylolisthesis*. Usually, a degree is between -180° and 180° (or between 0° and 360°). If we look at the other data, it seems that the coding of the angle is between -180° and 180° (with very few negative angles). Let's look at all the values out of the usual range of the degrees (it concerns only the variable *Degree Spondylolisthesis*).


In [None]:
data[data.degree_spondylolisthesis > 180]

Only one observation has a *Degree Spondylolisthesis* larger than 180. We can consider a typo in the decimal of this value. So, we replace the value 418.543082 by 41.8543082.

In [None]:
data.loc[115, 'degree_spondylolisthesis'] = 41.8543082

* Recode the variable *class* into a dummy variable (0: Abnormal, 1: Normal).

In [None]:
data['class'] = pd.get_dummies(data['class'], prefix='class', drop_first=True)

* Then, we look at the correlation between the different variables.

In [None]:
# Compute the correlation matrix.
corr_data = round(data.corr(),2)
corr_data.columns = ['Pelvic Incidence', 'Pelvic Tilt',
                'Lumbar Lordosis Angle', 'Sacral Slope',
                'Pelvic Radius', 'Degree Spondylolisthesis',
                'Pelvic Slope', 'Direct Tilt',
                'Thoracic Slope', 'Cervical Tilt',
                'Sacrum Angle', 'Scoliosis Slope',
                'Class']
corr_data.index = corr_data.columns

f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr_data, mask=None, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5},
            annot=True)
plt.show()


So, it appears that the class {Abnormal, Normal} is negatively correlated with the *Pelvic Incidence*, the *Pelvic Tilt*, the *Lumbar Lordosis Angle*, the *Sacral Slope* and the *Degree Spondylolisthesis* and positively correlated with the *Pelvic Radius*. The class has a very small correlation with the other variables.

* Let's look at some boxplot for these variables.

In [None]:
f, ax = plt.subplots(figsize=(30, 12))

plt.subplot(161)
sns.boxplot(y='pelvic_incidence', x='class', data=data)
plt.ylabel('Pelvic Incidence')
plt.xlabel('')
plt.xticks(np.arange(2), ('Abnormal', 'Normal'))

plt.subplot(162)
sns.boxplot(y='pelvic_tilt', x='class', data=data)
plt.ylabel('Pelvic Tilt')
plt.xlabel('')
plt.xticks(np.arange(2), ('Abnormal', 'Normal'))

plt.subplot(163)
sns.boxplot(y='lumbar_lordosis_angle', x='class', data=data)
plt.ylabel('Lumbar Lordosis Angle')
plt.xlabel('')
plt.xticks(np.arange(2), ('Abnormal', 'Normal'))

plt.subplot(164)
sns.boxplot(y='sacral_slope', x='class', data=data)
plt.ylabel('Sacral Slope')
plt.xlabel('')
plt.xticks(np.arange(2), ('Abnormal', 'Normal'))

plt.subplot(165)
sns.boxplot(y='degree_spondylolisthesis', x='class', data=data)
plt.ylabel('Degree Spondylolisthesis')
plt.xlabel('')
plt.xticks(np.arange(2), ('Abnormal', 'Normal'))

plt.subplot(166)
sns.boxplot(y='pelvic_radius', x='class', data=data)
plt.ylabel('Pelvic Radius')
plt.xlabel('')
plt.xticks(np.arange(2), ('Abnormal', 'Normal'))

plt.show()

## Subset features selection

In [None]:
model = ExtraTreesClassifier(n_estimators=200, random_state=0)
model.fit(data.drop('class', axis=1, inplace=False), data['class'])

importances = model.feature_importances_
importances_std = np.std([model_tree.feature_importances_ for model_tree in model.estimators_], axis=0)

In [None]:
res = {'Name':['Pelvic Incidence', 'Pelvic Tilt',
                'Lumbar Lordosis Angle', 'Sacral Slope',
                'Pelvic Radius', 'Degree Spondylolisthesis',
                'Pelvic Slope', 'Direct Tilt',
                'Thoracic Slope', 'Cervical Tilt',
                'Sacrum Angle', 'Scoliosis Slope'],
       'Importances':importances,
       'Importances_std':importances_std}
res = pd.DataFrame(res)
res = res.loc[np.argsort(res.Importances)]

plt.barh(y=range(res.shape[0]), width=res.Importances,
         xerr=res.Importances_std, align='center', tick_label=res.Name)
plt.xlabel('Variable importance')
plt.show()

So, we have an importance score for each attribute where the larger score the more important the attribute. As we see on the correlation plot, the variable *degree spondylolisthesis* and *pelvic radius*/*pelvic tilt*/*pelvic incidence*/*lumbar lordosis angle* are strongly correlated. We will consider only the variables *Degree Spondylolisthesis*, *Pelvic Radius*, *Pelvic Tilt* and *Pelvic Incidence* for building the model (the four with the strongest importance).

* Let's plot these variables with the class.

In [None]:
plt.scatter(data['degree_spondylolisthesis'], data['pelvic_radius'], c=data['class'])
plt.xlabel('Degree Spondylolisthesis')
plt.ylabel('Pelvic Radius')
plt.show()

## Model construction

* Split the dataset into train and test set.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(data[['degree_spondylolisthesis', 'pelvic_radius', 'pelvic_tilt', 'pelvic_incidence']], data['class'], test_size=1/3, random_state=42)

scaler = StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_test)

* Let's construct the baseline by setting the most frequent response in the training set to compare our model.

In [None]:
dummy = DummyClassifier(strategy='most_frequent', random_state=42)
dummy.fit(X_train_transformed, Y_train)
Y_pred_dummy = dummy.predict(X_test_transformed)

Y_pred_proba_dummy = dummy.predict_proba(X_test_transformed)[:, 1]
[fpr_dummy, tpr_dummy, thr_dummy] = metrics.roc_curve(Y_test, Y_pred_proba_dummy)

print("The accuracy for the dummy classifier is: %0.2f " % (metrics.accuracy_score(Y_test, Y_pred_dummy)))

* Use the Logistic Regression method to predict the class (by Cross-Validation and GridSearch).

In [None]:
param_log_reg = {'tol': np.logspace(-5, 1, 7),
                 'C': np.logspace(-3, 3, 7),
                 'penalty': ['l2']}

log_reg = GridSearchCV(LogisticRegression(solver='lbfgs'), param_log_reg, cv=10, iid=False)
log_reg.fit(X_train_transformed, Y_train)

print("Best parameters set found on development set:", log_reg.best_params_)

In [None]:
Y_pred_log_reg = log_reg.predict(X_test_transformed)

Y_pred_proba_log_reg = log_reg.predict_proba(X_test_transformed)[:, 1]
[fpr_log_reg, tpr_log_reg, thr_log_reg] = metrics.roc_curve(Y_test, Y_pred_proba_log_reg)

print("The accuracy for the Logistic Regression classifier is: %0.2f " % (metrics.accuracy_score(Y_test, Y_pred_log_reg)))

* Plot the ROC curve for the model.

In [None]:
plt.figure(figsize=(18,8))

plt.plot(fpr_dummy, tpr_dummy, color='blue', lw=2, label='Dummy Classifier - AUC = %0.2f' % metrics.auc(fpr_dummy, tpr_dummy))
plt.plot(fpr_log_reg, tpr_log_reg, color='red', lw=2, label='Logistic Regression - AUC = %0.2f' % metrics.auc(fpr_log_reg, tpr_log_reg))

plt.legend(loc = 'lower right')
plt.xlim([0.0, 1.05])
plt.ylim([0.0, 1.05])
plt.xlabel('1 - Specificity', fontsize=14)
plt.ylabel('Sensibility', fontsize=14)
plt.title('ROC curves', fontsize=18)
plt.show()