# LOWER BACK PAIN DATASET

### Introduction:

Back pain torments many worldwide which many times may have to do with certain spine anomalies.

It would require a much larger dataset for these models to be considered appropriate, but it is perfect to practice and have some fun with model building. 
Also, despite the title being "Lower back pain symptoms", this is a dataset based on spine spatial layout values and therefore the goal of this project is to classify as correctly as possible wether an individual will be labeled to have a 'Normal' or 'Abnormal' curved spine.

Type of Problem: Binary Classification

Models: Logistic Regression, SVC's, NN

For better accuracy, GridSearch and K-Fold cross-validation were performed.

Any feedback is much appreciated :) thank you!

## Part 1 - Data Exploration

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

import seaborn as sns
sns.set_style("darkgrid")
flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
sns.palplot(sns.color_palette(flatui))

In [None]:
df = pd.read_csv('../input/Dataset_spine.csv', na_filter = True, skip_blank_lines = True)

# Naming Columns
df.columns = ['Pelvic Incidence', 'Pelvic Tilt', 'Lumbar Lordosis Angle', 'Sacral Slope', 'Pelvic Radius',
              'Degree Spondylolisthesis', 'Pelvic Slope', 'Direct Tilt', 'Thoracic Slope', 'Cervical Tilt',
              'Sacral Angle', 'Scoliosis Slope','Target', '13']
df.drop('13', axis = 1, inplace = True)


### Plotting all the features for Normal and Abnormal classified individuals

 We can clearly see that there is indeed a different distribution of values between 'Normal' and 'Abnormal' classes.

In [None]:
fig, ax = plt.subplots(figsize=(15,8), ncols=4, nrows=3)

left   =  0.125  # the left side of the subplots of the figure
right  =  0.9    # the right side of the subplots of the figure
bottom =  0.1    # the bottom of the subplots of the figure
top    =  0.9    # the top of the subplots of the figure
wspace =  .5     # the amount of width reserved for blank space between subplots
hspace =  1.1    # the amount of height reserved for white space between subplots

plt.subplots_adjust(
    left    =  left, 
    bottom  =  bottom, 
    right   =  right, 
    top     =  top, 
    wspace  =  wspace, 
    hspace  =  hspace
)

y_title_margin = 1.2

plt.suptitle("Distribution of Values - Normal vs Abnormal", y = 1.09, fontsize=15)

sns.violinplot(x = 'Target', y  = 'Pelvic Incidence', data = df, ax=ax[0][0], palette = flatui)

sns.violinplot(x = 'Target', y  = 'Pelvic Tilt', data = df, ax=ax[0][1], palette = flatui)

sns.violinplot(x = 'Target', y  = 'Lumbar Lordosis Angle', data = df, ax=ax[0][2], palette = flatui)

sns.violinplot(x = 'Target', y  = 'Sacral Slope', data = df, ax=ax[0][3], palette = flatui)

# second row
sns.violinplot(x = 'Target', y  = 'Pelvic Radius', data = df, ax=ax[1][0], palette = flatui)

sns.violinplot(x = 'Target', y  = 'Degree Spondylolisthesis', data = df, ax=ax[1][1], palette = flatui)

sns.violinplot(x = 'Target', y  = 'Pelvic Slope', data = df, ax=ax[1][2], palette = flatui)

sns.violinplot(x = 'Target', y  = 'Direct Tilt', data = df, ax=ax[1][3], palette = flatui)

# third row
sns.violinplot(x = 'Target', y  = 'Thoracic Slope', data = df, ax=ax[2][0], palette = flatui)

sns.violinplot(x = 'Target', y  = 'Cervical Tilt', data = df, ax=ax[2][1], palette = flatui)

sns.violinplot(x = 'Target', y  = 'Sacral Angle', data = df, ax=ax[2][2], palette = flatui)

sns.violinplot(x = 'Target', y  = 'Scoliosis Slope', data = df, ax=ax[2][3], palette = flatui)

## Preparing Features and Target, train_test_splitting

In [None]:
# y
y = df['Target']
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
y = label.fit_transform(y)


# X
dfx = df.drop(['Target'], axis = 1)
X = dfx

# Splitting into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Part 2  -  Feature Selection - Extraction

This dataset has a lot of features, let's see if we can find the best ones and cut some off.
We will make a FeatureUnion, GridSearch-it with a simple logistic regression model and then we will see which features prove more valuable.

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.decomposition import PCA, KernelPCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# FEATURE SELECTION
selection = SelectKBest(k=1)

#FEATURE EXTRACTION
pca = PCA(n_components = 2)
k_pca = KernelPCA(n_components = 2)

# FEATURE UNION (FEATURE SELECTION + FEATURE EXTRACTION)
estimators = [('sel', selection),
              ('pca', pca),
              ('k_pca', k_pca)]  

combined = FeatureUnion(estimators)

X_features = combined.fit(X, y).transform(X)

## Part 3 -  Logistic Regression Pipeline

In [None]:
log_regression = LogisticRegression()
pipeline = Pipeline([("features", combined), ("log", log_regression)])

### 3.1. GridSearch

In [None]:
components = [1,2,3,4,5]
original = [1,2,3,4]
Cs = np.logspace(-4, 4, 3)
param_grid = dict(features__pca__n_components=components,
                  features__k_pca__n_components=components,
                  features__sel__k=original,
                  log__C=Cs)

grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=10)
grid_search.fit(X, y)

In [None]:
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

## Part 4 - SVC, already with previously obtained new features.

So in the former model the accuracy obtained was 73.9% and the best feature parameters were 1 kernel-pca component, 5 pca components and 1 selected feature.

In [None]:
# FEATURE SELECTION
selection = SelectKBest(k=3)

#FEATURE EXTRACTION
pca = PCA(n_components = 5)
k_pca = KernelPCA(n_components = 1)

# FEATURE UNION (FEATURE SELECTION + FEATURE EXTRACTION)
estimators = [('sel', selection),
              ('pca', pca),
              ('k_pca', k_pca)]  
combined = FeatureUnion(estimators)

X_features = combined.fit(X, y).transform(X)

In [None]:
from sklearn.svm import SVC
svc = SVC(kernel = 'rbf', random_state = 0)

pipeline2 = Pipeline([("features", combined), ("svc", svc)])

In [None]:
Cs = np.logspace(-4, 4, 3)
kernels = ['rbf','poly']
param_grid = dict(svc__C=Cs,
                 svc__kernel=kernels)

grid_search = GridSearchCV(pipeline2, param_grid=param_grid, verbose=10)
grid_search.fit(X, y)

In [None]:
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

So 69.4%... I guess we are better of with Logistic Regression on this one then! Now let's k-fold it.
Remember, the optimal C value was 10000.

## Part5 -  K-Folds to evaluate Pipeline

In [None]:
log_regression = LogisticRegression(C=10000)
final_model = Pipeline([("features", combined), ("log", log_regression )])

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = final_model, X = X_train, y = y_train, cv = 10)

avg_acc = accuracies.mean()
std_acc = accuracies.std()

print ("avg_acc: {} \nstd_acc: {}".format(avg_acc,std_acc))

With an average accuracy of 85% this model is not too shaby. 

## Part 6 - Applying a Neural Network: Will it surpass our expectations?

Let's create a fully connected neural network and see what is the result.
Since this is a binary classification problem, the last layer will have a sigmoid activation function, so that the value tends to either 'Normal' or 'Abnormal' labels. 

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
import keras.backend as K

K.clear_session()

model = Sequential()
model.add(Dense(24, input_dim = 12, activation='relu'))
model.add(Dense(6, activation ='relu'))
model.add(Dense(1, activation ='sigmoid'))
model.compile(SGD(lr=0.5),'binary_crossentropy',metrics=['accuracy'])
model.summary()

In [None]:
model.fit(X_train, y_train, epochs = 1000)

In [None]:
y_pred = model.predict(X_test)
y_class_pred = y_pred > 0.5

In [None]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test,y_class_pred)
acc

Looks like the model didn't do as well as a simple logistic regression. In the future, it would be fair to search for some better parameters and a better architecture.