# Introduction
Lower back pain is an annoyance that is familiar to many of us today. Whether it is an issue with the spine or a developing habit of bad posture, there are many potential causes. 

This notebook explores and extracts insight from the Lower Back Pain Symptoms Dataset. The dataset contains 12 features and 1 binary class representing whether or not the patient experienced lower back pain. The features are all measurements and parameters of the spine. **Therefore, the problem of interest is whether back pain can be predicted from measurements of spine parameters for instance from a spine x-ray.  **

Throughout this notebook, we go through some initial data exploration as well as some background research into the anatomy of our back. We then apply a variety of models to the data and evaluate the performance of each model.  

This notebook is a good tutorial for beginners looking to expand to working on new datasets. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from IPython.display import Image,display

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

Let us begin by loading in the dataset into a Pandas dataframe and take a peek into what the dataset contains.

In [None]:
#Column names ~ 12 features, 1 binary class
cols = ['pelvic_incidence','pelvic_tilt','lumbar_lordosis_angle','sacral_slope','pelvic_radius',\
        'degree_spondylolisthesis','pelvic_slope','direct_tilt','thoracic_slope','cervical_tilt',\
        'sacrum_angle','scoliosis_slope','normality']

#Read in data using column names, remove last column
data = pd.read_csv('../input/Dataset_spine.csv',header=0,names=cols,usecols=range(0,13))
data.head()

In [None]:
#Initial data stats
print("Number of examples: ",data.shape[0])
print("Number of features: ",data.shape[1]-1)
data.describe()

Finally, let's check for any missing data. It looks like we are good (y).

In [None]:
#Check for missing values 
data.isnull().sum()

# Understanding the Features
A deeper dive into the meaning of the features reveals some interesting and important information. All of these features represent different parameters of the back, specifically broken down into the cervical, thoracic, lumbar, and sacral areas of the spine as depicted in the first image below [1]. Degree spondylolisthesis and scoliosis are the only two features that can be spinal disorders. About half of these features (the pelvic and sacral features) are parameters in the pelvis and sacrum area which has been found to have a significant impact on the rest of the spine according to literature.

1. pelvic_incidence (degrees) - pelvic parameter,sum of two complementary angles: pelvic tilt and sacral slope, fixed for a given patient [2]

2. pelvic_tilt (degrees) - pelvic parameter 

3. lumbar_lordosis_angle (degrees) - curvature of spine (lordosis) in lumbar region,proportional to sacral slope 
4. sacral_slope (degrees) - pelvic parameter
5. pelvic_radius -
6. degree_spondylolisthesis (degrees) - a spinal disorder in which a bone (vertebra) slips forward onto the bone below it,  possibly slip angle, negative values might come from how this angle is measured
7. pelvic_slope - 
8. direct_tilt - 
9. thoracic_slope (degrees) - curvature of spine in thoracic region
10. cervical_tilt (degrees) - tilt in cervical region of spine 
11. sacrum_angle - 
12. scoliosis_slope (degree) - possibly cobbs angle to measure degree of scoliosis 
13. normality

Note: Some of the features did not yield any results in my search. While the name of these features are descriptive, I am definitely suspicious of these parameters representing similar information.

<img src="https://d11q7g6vqo5ah4.cloudfront.net/areas-of-the-spine-w320.jpg" width='30%' ></img>
<img src="https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/aa0d743d5ea374086e83cd07e1cd0f55c3a97b37/2-Figure1-1.png" width='30%'></img>

One interesting piece of information that an initial search of the feature definitions produced is that pelvic incidence is just the sum of the pelvic tilt and sacral slope. Let's double check this.

In [None]:
#Make a copy of the original data for understanding features
data_f = data.copy()

#Verify PI = PT + SS
data_f['pelvic_tilt + sacral_slope'] = data_f['pelvic_tilt']+data_f['sacral_slope']
cols = ['pelvic_incidence','pelvic_tilt + sacral_slope']
data_f[cols].head()

# Data Exploration
Now that we have a better understanding of what the features mean, we move onto exploring relationships within the data.

First, we look at the class variable. We see that there are about twice as many abnormal examples as normal ones. While it is not a perfect balance, at least the dataset is not severly unbalanced.

In [None]:
#Distribution for binary class variable 
print(data['normality'].value_counts())
sns.countplot(x="normality",data=data)

Next, we take a look at the correlation between the features visualized in a heatmap. The top left corner is very interesting at first glance. However, much of this is expected because we have already found out that pelvic incidence, pelvic tilt, and sacral slope are related. Based on literature, the pelvic and sacral parameters also have a significant impact on the spinal parameters [2]. 

It is interesting that only the top left corner is heavily correlated. I originally hypothesized that most of the features were measuring very similar parameters. However, this correlation heatmap would suggest otherwise. These uncorrelated features could be very important in distiguishing back pain from the unique information they provide.

In [None]:
#Visualize heatmap of correlations
corr_mtx = data.corr()
fig, ax = plt.subplots()
sns.heatmap(corr_mtx, square=True)

Let us now zoom into the top left corner and visualize how these features are correlated.

There are evidently some outliers by inspection from the pairplot. 

In [None]:
#Pair plot showing pairwise relationships between features that are somewhat correlated from heatmap 
use_cols = ['pelvic_incidence','pelvic_tilt','lumbar_lordosis_angle','sacral_slope','pelvic_radius',\
            'degree_spondylolisthesis','normality']
sns.pairplot(data[use_cols],size=2,hue='normality')
plt.show()

# Baseline Models
Now that we have a feel for the dataset and some initial insight, let's go ahead and use the raw data to create a few baseline models. We will start by just looking at accuracy as the evaluation metric. Namely, we will experiment with the following models:

    1. K-Nearest-Neighbor
    2. Naive Bayes
    3. Logistic Regression
    4. Support Vector Classifier
    5. Random Forest Classifier
    
First, let's make a copy of our original data and replace the categorical class variable with a numerical binary number. Abnormal will be 1 and normal will be 0.

In [None]:
#Convert categorical class to numerical binary
X = data.copy()
X = pd.get_dummies(X)
X.drop(X.columns[-1], axis=1, inplace=True)
X.rename(columns = {X.columns[-1]:'normality'},inplace=True)
X.head()

In addition, we split the data into a training and test set using a 60/40 split and visualize the distribution of the resulting split.

In [None]:
#Split train/test test
X_train,X_test,y_train,y_test = train_test_split(X.drop(X.columns[-1], axis=1),X['normality'],test_size=0.4,random_state=0)
print('X_train: ',X_train.shape)
print('X_test: ',X_test.shape)
print('y_train: ',y_train.shape)
print('y_test: ',y_test.shape)

fig, ax =plt.subplots(1,2)
sns.countplot(y_train,ax=ax[0])
ax[0].set_title('y_train')
sns.countplot(y_test,ax=ax[1])
ax[1].set_title('y_test')

Now, on to fitting models.

### K-NN

In [None]:
#Fit Nearest Neighbor model
model = KNeighborsClassifier(n_neighbors=3)
clf = model.fit(X_train,y_train)

train_score = clf.score(X_train, y_train)
test_score  = clf.score(X_test, y_test)

print("Nearest Neighbor Model: ")
print ("Training Score: {}\nTest Score: {}" .format(train_score, test_score))

### Naive Bayes

In [None]:
#Fit Naive Bayes model
model = GaussianNB()
clf = model.fit(X_train,y_train)

train_score = clf.score(X_train, y_train)
test_score  = clf.score(X_test, y_test)

print("Naive Bayes Model: ")
print ("Training Score: {}\nTest Score: {}" .format(train_score, test_score))

### Logistic Regression

In [None]:
#Fit logistic regression model 
model = LogisticRegression(random_state=0)
clf = model.fit(X_train,y_train)

train_score = clf.score(X_train, y_train)
test_score  = clf.score(X_test, y_test)

print("Logistic Regression Model: ")
print ("Training Score: {}\nTest Score: {}" .format(train_score, test_score))

### SVM

In [None]:
#Fit SVM model - does not do as good probably bc SVM are better when there are alot of features 
model = LinearSVC(random_state=0)
clf = model.fit(X_train,y_train)

train_score = clf.score(X_train, y_train)
test_score  = clf.score(X_test, y_test)

print("SVM Model:")
print ("Training Score: {}\nTest Score: {}" .format(train_score, test_score))

### Random Forest

In [None]:
#Fit Random Forest Model
model = RandomForestRegressor(max_depth=5,n_estimators=30,random_state=0)
clf = model.fit(X_train,y_train)

train_score = clf.score(X_train, y_train)
test_score  = clf.score(X_test, y_test)

print("Random Forest Model:")
print ("Training Score: {}\nTest Score: {}" .format(train_score, test_score))

### Model Performance
Overall, the K-NN with K = 3 results in the best performance with a test accuracy of ~87%. Not bad for an initial model with raw data. Logistic Regression also provides decent results with a test accuracy of ~81%.

Naive Bayes is somewhat unstable because it is based off the assumptions that the features are independent whereas half of these features are highly correlated. Nevertheless, we are able to achieve a test accuracy of ~75%. Similarly, our SVC model achieves a test accuracy of ~77%. However, SVMs tend to perform better when there are many features.

Finally, the random forest classifier seems to perform the worst with a test accuracy of ~46%. From the training accuracy of ~88%, we can see that the model is overfitting to the data. 

### What's next:
1. Feature engineering. Feature selection? http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
2. Outlier removal?
3. What are the most important variables? 
4. Scoliosis Regression
5. Model tuning & selection

# Resources
[1] Spine Anatomy - https://managebackpain.com/articles/spine-anatomy

[2] Pelvic parameters: origin and significance - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3175921/