Lower Back Pain Symptoms Dataset
===================================


https://www.kaggle.com/sammy123/lower-back-pain-symptoms-dataset

Lower back pain can be caused by a variety of problems with any parts of the complex, 
interconnected network of spinal muscles, nerves, bones, discs or tendons in the lumbar spine. Typical sources of low back pain include:


The large nerve roots in the low back that go to the legs may be irritated
The smaller nerves that supply the low back may be irritated
The large paired lower back muscles (erector spinae) may be strained
The bones, ligaments or joints may be damaged
An intervertebral disc may be degenerating
An irritation or problem with any of these structures can cause lower back pain and/or pain that radiates or is 
referred to other parts of the body. Many lower back problems also cause back muscle spasms, which don't sound like much but can cause severe pain and disability.

While lower back pain is extremely common, the symptoms and severity of lower back pain vary greatly. 
A simple lower back muscle strain might be excruciating enough to necessitate an emergency room visit, 
while a degenerating disc might cause only mild, intermittent discomfort.

This data set is about to identify a person is abnormal or normal using collected physical spine details/data.

In [None]:
# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing Pandas and NumPy
import pandas as pd, numpy as np

In [None]:
# Importing all datasets
back_pain = pd.read_csv("/kaggle/input/lower-back-pain-symptoms-dataset/Dataset_spine.csv")
back_pain.head()

### Rename of the Columns in the Dataframe

In [None]:
back_pain = back_pain.rename(columns={"Col1":"pelvic_incidence","Col2":"pelvic tilt","Col3":"lumbar_lordosis_angle","Col4":"sacral_slope","Col5":"pelvic_radius","Col6":"degree_spondylolisthesis","Col7":"pelvic_slope","Col8":"Direct_tilt","Col9":"thoracic_slope","Col10":"cervical_tilt","Col11":"sacrum_angle","Col12":"scoliosis_slope","Class_att":"Abnormal_Normal"})


In [None]:
back_pain.head()

In [None]:
back_pain.dtypes

In [None]:
back_pain.dtypes

In [None]:
# Drop 'Unnamed: 13' as this is not in use
back_pain.drop(['Unnamed: 13'], axis = 1, inplace = True)

In [None]:
back_pain.head(3)

#### Converting some binary variables (Yes/No) to 0/1

In [None]:
# List of variables to map

varlist =  ['Abnormal_Normal']

# Defining the map function
def binary_map(x):
    return x.map({"Abnormal": 1, "Normal": 0})

# Applying the function to the housing list
back_pain[varlist] = back_pain[varlist].apply(binary_map)

In [None]:
back_pain.head(2)

### Label Encoding

In [None]:
# import preprocessing from sklearn
from sklearn import preprocessing

# 1. INSTANTIATE
# encode labels with value between 0 and n_classes-1.
le = preprocessing.LabelEncoder()


# 2/3. FIT AND TRANSFORM
# use df.apply() to apply le.fit_transform to all columns
back_pain_2 = back_pain.apply(le.fit_transform)
back_pain_2.head(10)

### Rescaling the Features 

We will use MinMax scaling.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ["pelvic_incidence", "pelvic tilt", "lumbar_lordosis_angle", "sacral_slope", "pelvic_radius", "degree_spondylolisthesis", "pelvic_slope", "Direct_tilt", "thoracic_slope", "cervical_tilt", "sacrum_angle", "scoliosis_slope"]

back_pain_2[num_vars] = scaler.fit_transform(back_pain_2[num_vars])

back_pain_2.head()

In [None]:
back_pain_2.isnull().sum()

From , the Above Dataset, the `max-min` scaler is used to put all the values between `0 and 1`

## Checking for Outliers 

In [None]:
# Checking for outliers in the continuous variables
num_back_pain_2 = back_pain_2[["pelvic_incidence","pelvic tilt","lumbar_lordosis_angle","sacral_slope","pelvic_radius","degree_spondylolisthesis","pelvic_slope","Direct_tilt","thoracic_slope","cervical_tilt","sacrum_angle","scoliosis_slope","Abnormal_Normal"]]

In [None]:
# Checking outliers at 25%, 50%, 75%, 90%, 95% and 99%
num_back_pain_2.describe(percentiles=[.25, .5, .75, .90, .95, .99])

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use("dark_background")

## Distribution of pelvic_incidence with scoliosis_slope

In [None]:
#Apply matplotlib functionalities

#Change the colour of bins to green
#Change the number of bins

#Create a distribution plot for rating

#import the necessary libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

import seaborn as sns
plt.figure(figsize = [9,5])
sns.distplot(num_back_pain_2.pelvic_incidence,  bins = 40, color = "orange")
plt.title("Distribution of pelvic_incidence", fontsize = 20, fontweight = 10, verticalalignment = 'baseline')

plt.show()

In [None]:
sns.boxplot(num_back_pain_2.cervical_tilt)

### Test-Train Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
back_pain_2.head(3)

In [None]:
# Putting feature variable to X
X = back_pain_2.drop(['Abnormal_Normal'], axis=1)

X.head()

In [None]:
# Putting response variable to y
y = back_pain_2['Abnormal_Normal']

y.head()

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

In [None]:
# Let's see the correlation matrix 
plt.style.use("ggplot")
plt.figure(figsize = (20,10))        # Size of the figure
sns.heatmap(back_pain_2.corr(),annot = True,cmap="Greens")
plt.show()

### Model Building
Let's start by splitting our data into a training set and a test set.

#### Running Your First Training Model

In [None]:
import statsmodels.api as sm

In [None]:
# Logistic regression model
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()

 ### Feature Selection Using RFE

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [None]:
from sklearn.feature_selection import RFE
rfe = RFE(logreg, 10)             # running RFE with 13 variables as output
rfe = rfe.fit(X_train, y_train)

In [None]:
rfe.support_

In [None]:
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]

In [None]:
X_train.columns[~rfe.support_]

### Dropping the Variable as Identified by the RFE to reduce complexity

In [None]:
X_train = X_train.drop(['lumbar_lordosis_angle'], axis=1)

In [None]:
X_train = X_train.drop(['thoracic_slope'], axis=1)

#### Assessing the model with StatsModels

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

### Checking VIF

Variance Inflation Factor or VIF, gives a basic quantitative idea about how much the feature variables are correlated with each other. It is an extremely important parameter to test our linear model. The formula for calculating `VIF` is:

### $ VIF_i = \frac{1}{1 - {R_i}^2} $

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train = X_train.drop(['pelvic_incidence'], axis=1)
col = X_train.columns
X_train = X_train[col]

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train = X_train.drop(['pelvic_slope'], axis=1)
col = X_train.columns
X_train = X_train[col]

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train = X_train.drop(['Direct_tilt'], axis=1)
col = X_train.columns
X_train = X_train[col]

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train = X_train.drop(['cervical_tilt'], axis=1)
col = X_train.columns
X_train = X_train[col]

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train = X_train.drop(['scoliosis_slope'], axis=1)
col = X_train.columns
X_train = X_train[col]

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
X_train = X_train.drop(['sacrum_angle'], axis=1)
col = X_train.columns
X_train = X_train[col]

In [None]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Getting the predicted values on the train set
y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

In [None]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

#### Creating a dataframe with the actual abnormal/normal and the predicted probabilities

In [None]:
y_train_pred_final = pd.DataFrame({'Abnormal_Normal':y_train.values, 'Abnormal_Normal_Prob':y_train_pred})
y_train_pred_final.head()

 ##### Creating new column 'predicted' with 1 if Abnormal, Normal_Prob > 0.5 else 0

In [None]:
y_train_pred_final['predicted'] = y_train_pred_final.Abnormal_Normal_Prob.map(lambda x: 1 if x > 0.5 else 0)

# Let's see the head
y_train_pred_final.head()

In [None]:
y_train_pred_final.predicted.value_counts()

In [None]:
from sklearn import metrics

In [None]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_pred_final.Abnormal_Normal, y_train_pred_final.predicted )
print(confusion)

In [None]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Abnormal_Normal, y_train_pred_final.predicted))

## Metrics beyond simply accuracy

In [None]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))

In [None]:
# positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))


## Plotting the ROC Curve
An ROC curve demonstrates several things:

It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

In [None]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [None]:
fpr, tpr, thresholds = metrics.roc_curve( y_train_pred_final.Abnormal_Normal, y_train_pred_final.Abnormal_Normal_Prob, drop_intermediate = False )

In [None]:
draw_roc(y_train_pred_final.Abnormal_Normal, y_train_pred_final.Abnormal_Normal_Prob)

Finding Optimal Cutoff Point

In [None]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_pred_final[i]= y_train_pred_final.Abnormal_Normal_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

In [None]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Abnormal_Normal, y_train_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

In [None]:
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

#### From the curve above, 0.7 is the optimum point to take it as a cutoff probability.

In [None]:
y_train_pred_final['final_predicted'] = y_train_pred_final.Abnormal_Normal_Prob.map( lambda x: 1 if x > 0.7 else 0)

y_train_pred_final.head()

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_train_pred_final.Abnormal_Normal, y_train_pred_final.final_predicted)

In [None]:
confusion2 = metrics.confusion_matrix(y_train_pred_final.Abnormal_Normal, y_train_pred_final.final_predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

In [None]:
# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))

In [None]:
# Positive predictive value 
print (TP / float(TP+FP))

In [None]:
# Negative predictive value
print (TN / float(TN+ FN))

## Precision and Recall
## Looking at the confusion matrix again

In [None]:

confusion = metrics.confusion_matrix(y_train_pred_final.Abnormal_Normal, y_train_pred_final.predicted )
confusion

Precision
TP / TP + FP

In [None]:
confusion[1,1]/(confusion[0,1]+confusion[1,1])

Recall
TP / TP + FN

In [None]:
confusion[1,1]/(confusion[1,0]+confusion[1,1])

### Using sklearn utilities for the same

In [None]:
from sklearn.metrics import precision_score, recall_score

In [None]:
precision_score(y_train_pred_final.Abnormal_Normal, y_train_pred_final.predicted)

In [None]:
recall_score(y_train_pred_final.Abnormal_Normal, y_train_pred_final.predicted)

## Precision and recall tradeoff

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
y_train_pred_final.Abnormal_Normal, y_train_pred_final.predicted

In [None]:
p, r, thresholds = precision_recall_curve(y_train_pred_final.Abnormal_Normal, y_train_pred_final.Abnormal_Normal_Prob)

In [None]:
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

###  Making predictions on the test set

In [None]:
X_test.columns

In [None]:
X_test.head(3)

In [None]:

X_test[["pelvic_incidence","pelvic tilt","lumbar_lordosis_angle","sacral_slope","pelvic_radius","degree_spondylolisthesis","pelvic_slope","Direct_tilt","thoracic_slope","cervical_tilt","sacrum_angle","scoliosis_slope"]] = scaler.transform(X_test[["pelvic_incidence","pelvic tilt","lumbar_lordosis_angle","sacral_slope","pelvic_radius","degree_spondylolisthesis","pelvic_slope","Direct_tilt","thoracic_slope","cervical_tilt","sacrum_angle","scoliosis_slope"]])

In [None]:
X_test = X_test[col]
X_test.head()

In [None]:
X_test_sm = sm.add_constant(X_test)

Making predictions on the test set

In [None]:
y_test_pred = res.predict(X_test_sm)

In [None]:
y_test_pred[:10]

In [None]:
# Converting y_pred to a dataframe which is an array
y_pred_1 = pd.DataFrame(y_test_pred)

In [None]:
# Let's see the head
y_pred_1.head()

In [None]:
# Converting y_test to dataframe
y_test_df = pd.DataFrame(y_test)

In [None]:
# Putting CustID to index
y_test_df['ID'] = y_test_df.index

In [None]:
# Removing index for both dataframes to append them side by side 
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [None]:
# Appending y_test_df and y_pred_1
y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)

In [None]:
y_pred_final.head()

In [None]:
# Renaming the column 
y_pred_final= y_pred_final.rename(columns={ 0 : 'Abnormal_Normal_Prob'})

In [None]:
# Rearranging the columns
y_pred_final = y_pred_final.reindex(['ID','Abnormal_Normal','Abnormal_Normal_Prob'], axis=1)

In [None]:
# Let's see the head of y_pred_final
y_pred_final.head()

In [None]:
y_pred_final['final_predicted'] = y_pred_final.Abnormal_Normal_Prob.map(lambda x: 1 if x > 0.54 else 0)

In [None]:
y_pred_final.head()

In [None]:
# Let's check the overall accuracy.
metrics.accuracy_score(y_pred_final.Abnormal_Normal, y_pred_final.final_predicted)

In [None]:
confusion2 = metrics.confusion_matrix(y_pred_final.Abnormal_Normal, y_pred_final.final_predicted )
confusion2

In [None]:
TP = confusion2[1,1] # true positive 
TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)

In [None]:
# Let us calculate specificity
TN / float(TN+FP)

### Abnormal_Normal Final Test Probability :

In [None]:
y_pred_final.final_predicted.value_counts()

In [None]:
y_pred_final.head(6)